[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference bulova::decw_jan-89_to_nov-90

Title:	DECWINDOWS 26-JAN-89 to 29-NOV-90
Notice:	See 1639.0 for VMS V5.3 kit; 2043.0 for 5.4 IFT kit
Moderator:	STAR::VATNE

Created:	Mon Oct 30 1989
Last Modified:	Mon Dec 31 1990
Last Successful Update:	Fri Jun 06 1997
Number of topics:	3726
Total number of notes:	19516

60.0. "SCS as transport for X?" by ANTPOL::PRUSS (Dr. Velocity) Sun Jan 29 1989 10:36

    We have the ongoing discussion on TCP/IP transport, how about another
    transport question.
    
    Could SCS be used as a transport for nodes within a cluster?  Is
    the SCS protocol amenable to being used in this fashion?  Is there
    any reason to believe it would offer better performance than DECnet?
    
    -fjp

T.R	Title	User	Personal Name	Date	Lines
60.1		STAR::KLEINSORGE	shockwave rider	`Sun Jan 29 1989 14:50`	10
	Hmmm. Which workstation has a CI? The only one I know of that "could" use SCS effectively would be the VS8000, but it doesn't support a CI adapter for it's BI. A more interesting idea might be using LAT as the transport, it's simple, small and fast.
60.2	This window manager is confused.	ANTPOL::PRUSS	Dr. Velocity	`Sun Jan 29 1989 20:05`	6
	I thought we used SCS on the Ethernet in an NI/MI Vc. There aren't enough slots in a VS8000 for a CI, but that would be an interesting tangent! -fjp
60.3	Not really!!	SKRAM::SCHELL	Working it out...	`Sun Jan 29 1989 22:03`	16
	> > Hmmm. Which workstation has a CI? The only one I know of that > "could" use SCS effectively would be the VS8000, but it doesn't > support a CI adapter for it's BI. > > A more interesting idea might be using LAT as the transport, it's > simple, small and fast. Whoa!! SCS is not a CI only protocol. SCS runs on LAVC's, using the Ethernet as a transport. I think the real question is if SCS is a better protocol than DECnet task-to-task??? Mark
60.4	Forgive me, but I just finished reading all the TCP stuff...	DECWIN::FISHER	Burns Fisher 381-1466, ZKO3-4/W23	`Mon Jan 30 1989 17:38`	5
	Oh great...you want us to "support" this too, or shall we just ship the image for everyone to play with? Burns
60.5		MAXWIT::PRUSS	Dr. Velocity	`Mon Jan 30 1989 18:34`	12
	What, you mean you have it working already and are holding out on us?! :-) Just an idle question for speculation, really. But I kind of like the idea of sending stuff to a VAXstation 8000 from the VAX_THAT_IS_YET_TO_COME over the CI. We know that SCS is much more efficient than DECnet FAL for file transfer. I have no idea how it would compare to task-to-task for the X protocol. -fjp
60.6		STAR::KLEINSORGE	shockwave rider	`Tue Jan 31 1989 00:03`	11
	My wife did a prototype of the "LAST" (a LAT derivitive) driver that talked directly to the CI. Don't remember the numbers offhand, it was very, very fast. And raw LAT is probably about as quick as you want over the ethernet (though DECnet isn't a slouch on the ethernet for just raw data communication according to her tests).
60.7	Remember DECnet over CI?	DECWIN::FISHER	Burns Fisher 381-1466, ZKO3-4/W23	`Tue Jan 31 1989 09:08`	10
	Well, I can't say I know much about this, but remember when CI first appeared a few years ago? It was possible to run a DECnet circuit over the CI. After a while, though, it was determined that it was much more efficient to run DECnet over the either and SCS over the CI. (Now of course we also have SCS running over the ether as well). This may prove nothing, except to show that there is precedent for deciding that is was better to use ether than CI for one particular class of communication protocol. Burns
60.8		LESLIE::LESLIE	Andy �� Leslie, CSSE / VMS	`Tue Jan 31 1989 13:00`	3
	The reason it was inefficient and thus slow was that it still used DECnet! Using native protocols would be much faster - and is!
60.9		STAR::SNAMAN	Sandy Snaman, VMS Development	`Wed Feb 01 1989 11:19`	12
	Re .7: Regarding the old wisdom of running DECnet over the Ethernet rather than on the CI. Some recent performance testing has shown that this has been a myth for some time. The advent of processors faster than a 780 made it possible to do substantially better using DECnet on the CI than on the Ethernet.
60.10		KONING::KONING	NI1D @FN42eq	`Wed Feb 01 1989 18:12`	13
	I think the reason there is an NISCS isn't because it's faster (inherently) than DECnet, but because it was the way the VAXclusters software could be made to run on an NI. So it's questionable whether that would be any better than DECnet to talk X to workstations. As for LAT, remember that LAT is a request-response asymmetric protocol optimized for the character-at-a-time interactive exchanges of dumb terminals. X uses a very different sort of data flow pattern (pipelined rather than request-response) and is unlikely to run as well, let alone better, on LAT than on DECnet. paul
60.11	I doubt that an SCS transport on the Ethernet would be much different than DECnet	STAR::BECK	Paul Beck	`Wed Feb 01 1989 20:28`	9
	Paul K is correct in .10 as to the rationale for NISCS. There is relatively little difference between well-optimized DECnet performance on the Ethernet and equivalent performance using NISCS. The evidence for this is in the performance figures of DFS, which uses DECnet, but which comes quite close to LAVc performance on Ethernet. The numbers aren't identical, but then they're not doing exactly the same things once they get off the wire. (Comparing LAVc with DAP will not produce favorable comparison for DAP, on the other hand.)
60.12	Then why is so much effort being put into a DECwindows transport using LAT?	IO::MCCARTNEY	James T. McCartney III - DTN 381-2244 ZK02-2/N24	`Fri Feb 03 1989 16:28`	7
	An un-announced product to come out of DSG is planning to use LAT to transport the X-wire protocol. If this is not such a good idea, then what needs to be done to get them to change their implementation strategy? James
60.13	*sigh*	KONING::KONING	NI1D @FN42eq	`Fri Feb 03 1989 17:03`	5
	We who have been trying to change that approach have been wondering about that as well. So far nothing has worked. paul
60.14		RAMBLR::MORONEY	Better to burn out than it is to rust...	`Fri Feb 03 1989 22:01`	11
	I would suggest using a separate Ethernet protocol for DECwindows transport, rather than trying to lay it on LAT or SCS. This way the driver code, packet formats, etc. can be optimized for the type of traffic expected. I'd guess that Windows on ethernet SCS would probably do OK, but on LAT would be poor since, as mentioned, LAT is optimized more for single characters more. 'Windows seems to be a big enough part of DEC's future that it should deserve its own Ethernet protocol. -Mike
60.15		MIPSBX::thomas	The Code Warrior	`Sat Feb 04 1989 00:24`	9
	A good implementation of NSP serves quite nicely as a transport for the X protocol. Since the X protocol consistently generates bidirectional traffic all data ACKs tend to be piggybacked. Thus the almost all the traffic tends to be X packets with very little overhead. Note: VMS DECwindow users may want to raise their workstations pipline quota to 8K or more to allow DECnet-VMS to request delayed more frequently.
60.16	maybe already being done	ATLAST::BOUKNIGHT	W. Jack Bouknight	`Sat Feb 04 1989 17:34`	6
	re: .15, VMS DECwindows startup already checks for and SETs DECnet EXEC PIPELINE QUOTA to 10000. I assume thats the parameters you were recommending be changed. Jack
60.17		KONING::KONING	NI1D @FN42eq	`Mon Feb 06 1989 12:27`	13
	Re .14: just because there is a big market for something doesn't mean that it should have a protocol of its own. In fact, just the opposite is true: by using standard protocols, you make the product even more attractive. That's doubly true since, as was mentioned, X runs well over DECnet and there is no reason to believe that it will run substantially better over any other transport. Besides, developing additional transports is expensive, counterstrategic, etc. It prevents things from running over wide area networks, gives a "we don't care about standards" message, and so on. paul
60.18		VISA::BIJAOUI	Tomorrow Never Knows	`Tue Feb 07 1989 02:42`	20
	�That's doubly true since, as was mentioned, X runs well over DECnet and �there is no reason to believe that it will run substantially better over �any other transport. Besides, developing additional transports is I'm feeling a bit doubtful about this statement. So far, we have had a number of problems using X over DECnet, links being losts and so on, and I'm sure a LAT (ethernet, whatever you want) based transport would be an excellent solution (especially for LAVc's). Internally, we are moving towards hidden areas to be able to connect our VAXstations on the network. DECnet phase V is too far away, and I believe we really need a LAT (ethernet, whatever you want) transport. At least, something that doesn't get stuck in a bottleneck that a L2 Router can be in such case. Anyway, have you ever gather statistics (e.g. packet/sec) of DECnet usage when using X across it ? Pierre.
60.19	???	PSW::WINALSKI	Paul S. Winalski	`Tue Feb 07 1989 14:13`	26
	RE: .-1 > DECnet phase V is too far away, and I believe we really need a LAT > (ethernet, whatever you want) transport. At least, something that > doesn't get stuck in a bottleneck that a L2 Router can be in such case. I don't understand this. If your VAXstation is plugged into an ethernet, the DECnet runs over it. Stations on the same ethernet can talk directly to each other without involving any routing node whatsoever, let along a L2 router, if the stations are in the same area. > I'm feeling a bit doubtful about this statement. So far, we have had a > number of problems using X over DECnet, links being losts and so on, > and I'm sure a LAT (ethernet, whatever you want) based transport would > be an excellent solution (especially for LAVc's). If you are talking about LAVc, then all of the nodes MUST be in the same DECnet area, and they all must be on an ethernet. DECnet works just fine without involving any routing nodes in these circumstances, and it uses ethernet. Assuming that your ethernet hardare is configured properly, the only case where you should be seeing logical links broken is when one or the other machine goes down, and there is no preventing that. I don't understand your problem here. --PSW
60.20		STAR::KLEINSORGE	Toys 'R' Us	`Tue Feb 07 1989 14:48`	21
	Paul, it's a common perception that LAT often works better than DECnet on the ethernet, especially if you've ever been on one of the segments in ZK. I often have two machines next to each other that refuse to see each other, and CTERM over a couple of bridges in this building can be a hazard. On the otherhand, I quit using SET HOST a long time ago because LAT proved so much more reliable (hence the perception) and often I get pissed when a copy tells me that my node isn't reachable when I'm VWSLATed from my node at the time I get the message. It may be that the difference is that DECnet is picky about making sure that the data actually gets there and gets there correctly, while LAT assumes everything is peachy and has much less error checking and looser "tolerences" (an error!? hey, let's send it again...). Anyway, from a typical "user" LAT usually looks more reliable.
60.21	!!!	VISA::BIJAOUI	Tomorrow Never Knows	`Tue Feb 07 1989 14:59`	68
	Re: .19 �I don't understand this. If your VAXstation is plugged into an ethernet, �... No. Not if the VAXstation is in a hidden area (which is, for our case, area 63). In the area 51 (the regular one), we got a L2 router which talks to another L2 router which stands in area 63. The path for a packet from a satellite to the boot node (which stands in area 51, because we need access to the WAN) is then thru the two L2 routers. I believe you can get more info in the notesfile IAMOK::HIDDEN_AREAS. �If you are talking about LAVc, then all of the nodes MUST be in the same �DECnet area, and they all must be on an ethernet. DECnet works just fine No, the nodes aren't in the same area, but they on the same LAN. Although they are on the same LAN, they got to go thru the 2 L2 routers for DECnet communications (but not for LAT or SCS communication). �ethernet. Assuming that your ethernet hardare is configured properly, the �only case where you should be seeing logical links broken is when one or �the other machine goes down, and there is no preventing that. I don't No, we have had cases where links were lost without having one node or the other being down. It's just that the DECwindows server just can't cope with the buffers (as I understood it). Note #293.0 in the notesfile HANNAH::DECW$DISK:[PUBLIC]DECTERM describes more the problem. I quote without permission some of the content of the note. You can go in the notesfile to get the exact context, for better accuracy. � Occasionally when the server has replies and events to write to a client � and network output buffers are unavailable to perform the write operation, � the current server would attempt the same write for a number of times � prior to disconnecting the non-responsive client. � In the duration of the retries, the server would not serve any other � client, and to the user, it would appear that the server is hung. As you can see the server will hang, but sometimes, I believe when time-out occurs, the server just gives up and drops everything on the floor. As a fix for the moment, we raised the Maximum buffer parameter in the boot node exec (from 100 to 200) and the pipeline quota. And wait and see. In our area (51), we should run out of numbers in a couple of months. What will happen to the dozen of VAXstation I have ordered ? Run them standalone, out of the network ? Naah, everybody needs the net, so we just got to squeeze our elbows, waiting for DECnet phase V that should (as I understood it) solve the limitation of 64 areas and 1023 nodes per area, and use the concept of hidden areas. There maybe other concepts, but I ain't a specialist in this area, IAMOK::HIDDEN_AREAS covers more of the problem. (sigh) C'est la vie ! Pierre.
60.22		KONING::KONING	NI1D @FN42eq	`Tue Feb 07 1989 16:37`	18
	There are definitely some misunderstandings about DECnet going around here, which isn't helping the signal to noise ratio. It does NOT matter whether your areas are the same, different, hidden, or not. If you're going from one endnode to another on the same Ethernet, then traffic will go direct (after a few initial packets). If the host is a router, then things aren't always that efficient, but then again if you run routing on your hosts things are slower anyway. As for DECnet being flaky, there may be some resource allocation problems, bugs, or whatnot. Certainly things can get bad when some of the routers in the area are inadequate (e.g., 750s or worse). There is nothing in the architecture that makes DECnet any more or less reliable, as far as links staying up is concerned, than LAT. Certainly there is no such issue as "less error checking" or "looser tolerances". paul
60.23		PSW::WINALSKI	Paul S. Winalski	`Tue Feb 07 1989 16:54`	27
	RE: .21 You are assigning the blame for lost client/server communication to the wrong place. The DECnet logical link remains intact--the problem is that the X server is single-threaded and times out client applications on its own, independent of the state of the DECnet logical link. This is a bug in our current X server implementation and is independent of the protocol used to provide the client/server transport. Switching to SCS or LAT would not solve the problem--the X server would still run out of buffers and you'd still be disconnected. This is a problem that should be fixed where the problem occurs--in the server. As far as hidden areas go, we should not be making strategic product design decisions (such as what protocols to use for X) on the basis of temporary configuration problems on our own internal network. RE: LAT No question about it--LAT performs magnificently for what it was designed to do, which is to package single-byte transmissions on multiple virtual circuits into a single ethernet message between a terminal server and its client CPU. It is better than CTERM at this. However, X is a message-passing protocol, and I question whether LAT would work as well as DECnet or SCS. --PSW
60.24		VISA::BIJAOUI	Tomorrow Never Knows	`Wed Feb 08 1989 03:19`	34
	Re: .22 Well, believe it or not, our DECrouter2000's (which are the most powerful L2 routers at the moment, correct me if am wrong) do see the packets we are sending from one workstation to the boot node. As well, how should I consider the Appendix A, paragraph A.6, page A-16, of the Networking Manual ? Have they got it wrong ? Re:.23 From my user's point of view, what I see is a lost DECnet link. Whether it's DECnet or a server or a client doesn't matter to me. The link is lost, my work is lost. I'm glad you've found the bug, I'm sure it will be fixed for a future release of DECwindows. If, on top of that, it suppresses the occasional hangs I have on my VAXstation, then perfecto. �As far as hidden areas go, we should not be making strategic product design �decisions (such as what protocols to use for X) on the basis of temporary �configuration problems on our own internal network. I definitly agree. But I didn't imagine that adding another transport to the set of DECwindows' transport could be a "strategic product design". Nevertheless, I will ask again my question: Has anybody ever mesured the packet rate per second (for instance) over DECnet that DECwindows generates from a local application to a remote display ? Any statistics produced ? Any performance tests ? Pierre.
60.25	Area 51 speaking	CASEE::LACROIX	No future	`Wed Feb 08 1989 03:40`	23
	Re previous: I'm in area 51 too... We have lots of workstations on a private Ethernet segment, and we were running into this problem of DECnet links being dropped on the floor (yes, it could be the X server timing out of its own). Gurus in CASEE came with a very successful hack a couple of months ago: basically, whenever a workstation was rebooted, NETACP on the boot member was going paging like crazy going thru the entire net database, looking for info on the workstation. That, plus the MOM process and a too small working set for NETACP was causing ALL X connections between the boot member and other workstations to be aborted. The fix is to use an area number small enough to cut down on NETACP's paging rate: area 1. Our boot member now thinks our workstations are in area 1, and thus finds info on what it should do with our workstation turbo fast. No more paging, no more links dropped on the floor, no more 10 seconds cluster transitions, etc... Inidentally, folks with talk to in the states were not very receptive to the problems we were having; I suspect this is related to the fact that you have a smaller problem when all your satellites are in area 3. Denis.
60.26		STAR::BRANDENBERG	Intelligence - just a good party trick?	`Wed Feb 08 1989 09:54`	20
	re: various What PSW said about the location of the problem is absolutely correct. The problem begins with a poorly designed protocol, is aggravated by the VMS interface to DECnet, was only partially corrected by the transport, and a last-chance keep-alive effort was made in the server. There is work-in-progress to make future versions better. What can you do now? Use tcp/ip. Yes, even for vms-to-vms connections. Lat? It's being looked at but what paul suggested may be true. A protocol may or may not save you. Lat works nicely when there are many data sources mapped to many data sinks but what will happen when there is a single data sink (a server). As for network load statistics, a test suite has been created and numbers have been collected. A report is being written (I haven't seen it yet). It should be interesting. monty
60.27		KONING::KONING	NI1D @FN42eq	`Wed Feb 08 1989 11:19`	8
	Re the problem of NETACP taking so much time on downline load requests: that certainly is a problem. It has been known for years. There are various obvious solutions that haven't been implemented. However, none of that has ANYTHING to do with the issue of which transport is appropriate for X. paul
60.28	Technical reasons for protocol problems?	WINERY::ROSE		`Wed Feb 08 1989 14:45`	6
	Re .26: "The problem begins with a poorly designed protocol..." I realize this is kind of complicated, but could you please elaborate? (This is not an argument, but I am just very curious because when reading over the X protocol I did not see anything particularly wrong.)
60.29	You'd think they'd learn after a while	PRNSYS::LOMICKAJ	Jeff Lomicka	`Thu Feb 09 1989 12:54`	6
	It seems like the modern equivalent of assuming all computer terminals will operate at 38.4KB continuously without the use of xon/xoff... Figures, considering the source.
60.30		DECWIN::FISHER	Burns Fisher 381-1466, ZKO3-4/W23	`Thu Feb 09 1989 15:30`	5
	A couple of notes here were hidden pending a discussion among the moderators. We got a complaint. Burns
60.31	Wait a minute...	CIM::KAIRYS	Michael Kairys	`Thu Feb 09 1989 15:42`	21
	I would like to complain in the reverse direction. I was fortunate to have read note .29 just minutes ago, prior to its being set hidden. I believe I can guess what prompted the impulse to hide it. However, I think the note presented information and a point of view that is important and needs to be aired. I think .29 should be used to start a discussion about real-world requirements which may (and should) lead to those requirements being addressed. My area of concern is discrete manufacturing; perhaps not as "critical" in some senses as nuclear engineering but nontheless an area which demands dependable delivery of information and needs windowing technology. Perhaps the note could be slightly edited, if someone insists, and returned to view. Personally it didn't seem inflammatory to me, but I'm from Ann Arbor... BTW, I also think note .31 presents a point of view about the history of X that is worth (re?)stating. -- A Concerned Citizen
60.32		DECWIN::FISHER	Burns Fisher 381-1466, ZKO3-4/W23	`Thu Feb 09 1989 16:57`	7
	There was not an "impulse" to hide it. Someone (not from VMS development, I might add) was concerned about aspects other than inflammation. Please let it go at that for the moment. I did not say this was the final word. That is what "hide" is for as opposed to "delete". Burns, unfortunately a moderator
60.33	Odd that LAT, not SCS, is the main topic when LAT was in another note	CVG::PETTENGILL	mulp	`Thu Feb 09 1989 19:25`	29
	re: .23 >No question about it--LAT performs magnificently for what it was designed to do, >which is to package single-byte transmissions on multiple virtual circuits into >a single ethernet message between a terminal server and its client CPU. The above statement is about `1/3 true'. Bruce Mann usually talks about his experience developing network applications (based on DECnet) when talking about the goals he had for LAT. He wanted a fast (ie., low in network and CPU overhead), fast (just in case you missed it before), simple (ie., something that didn't take an army of programmers and managers), simple (ie., something that one person could do and that would be implemented widely), LAN transport. Most of the work that Bruce was doing was realtime data aquisition, but terminal character echoing is best if it is in realtime, so terminal I/O is very applicable. LAT is NOT Local Area Terminal; LAT is Local Area TRANSPORT. LAT and SCS have a number of things in common (Bruce was involved in the architecture of both): They both multiple multiple sessions over a single virtual circuit and they both plug into the applications in the kernel rather in user mode. While these points make interfacing them to the system more difficult, there is usually a payoff in terms of performance. LAT was always intended to a multipurpose tools for supporting specialized LAN applications. X was intended to be a LAN application. Depending on how users use X, LAT+X may be a real winner. If X replaces ASCII, as it does with an X terminal, then the use will be right as fas as I can tell.
60.34	Regarding the Protocol	STAR::BRANDENBERG	Intelligence - just a good party trick?	`Fri Feb 10 1989 12:36`	93
	re .28: Yes, you did, it's practically on page one but it's so huge, none seem to notice. Consider typical client/server operation: the client sends asychronous requests to the server while the server sends asynchronous events to the client (say resulting from mouse motion of window reconfigure). Only occasionally do the server and client come together and synchronize their communications with a request/reply pair. What does this mean? It means that the only thing that keeps a client/server connection running is the buffering capability of the underlying transport implementation. A server in the throes of generating motion events or window reconfigure events will run through code that commits the server to sending events to at least one and sometimes many client connections. When this happens, the buffering capacity had better become available soon or the server will wait until it does. The way the user's data is buffered on Berkeley-style networking implementations, it often is available. But, say with a record-oriented interface and quota scheme as with the VMS interface to DECnet, it would almost never be available without additional work by the application. (This is one of the intended functions of the common transport image on VMS.) "Well, then it's a VMS problem, isn't it?" No. I'm the first to admit that the VMS interfaces are often inconvenient for getting work done but in this case, they merely exaggerated a problem with the protocol they did not create it. About two years ago, after finishing one of the first ports of the server to VMS, we experienced frequent deadlocks due to this problem (I should say we experienced infrequent successful operation). I poked around, looked at the system, looked at the design and said, "Look, this protocol is a deadlocking protocol." I received very little indication that anyone understood the problem or that they anyone was interested. At this point, in my opinion, we should have worked on the server semantics, or changed the protocol, or... something but it didn't happen. "Well, um, in R3 they fixed xlib to keep reading from the server if it can't write requests." Yes. On Unix. But is that enough? Must the operating system provide the means of recovery from a bad protocol? Should a "reliable, production-quality, bullet-proof" server rely on the good behaviour of its clients to ensure that it continues to execute? Should it rely on the stability and predictability of a network populated with LAVC's, NFS-served systems, diskless systems, gateways, bridges, etc.? Should it rely on some unknown operating system scheduling its clients so that it can continue operation? These are the sorts of questions one must ask when designing a reliable, distributed system. Answers are even better but I don't have any that are clear and absolute. How about some scenarios? Here are some possibilities which I can imagine (though they may not exist in fact). And yes, they're pathological but they are intended as illustrations to encourage discussion of the technology. 1) A standalone workstation whose user has a few xterms, a wmohc (window manager of his choice), a clock, etc. He runs an X application which creates windows, does some work, and interrupts it leaving it around but not running. He goes on to do other things like pop windows and drag his mouse around. All of a sudden, his workstation hangs while the server tries to send some events to a client that isn't running. How do you recover? 2) A workstation on a network has a client from a diskless workstation. The link gets a bit behind while the client tries to write some requests so it, being an R3 system, dutifully tries to read from the server. But the code that reads takes a page fault and the NFS server has just crashed. Three seconds later, the X server wants to tell this client about the 180 motion events that have occurred and so it hangs. All because of a nfs server two hops away. "But, I've been programming on X workstations for years and it's usually worked for me!" Well, so what? Is this proof by example? Let's be honest with ourselves: the primary use of X systems up to this point has been as programmer's workstations, to develop programmer's tools, all to help programmers. Only now is it moving out into non-programming and non-engineering tasks. I hope I'm not bursting anyone's bubble with this proposition but, in my opinion, the standards of reliabilty and quality to which programmers in the world at large hold themselves do not compare favoribly with those in most other engineering and non-engineering activities. By analogy, programming is to, say, civil engineering what astrology is to astronomy or what numerology is to mathematics. Consider: a power company might investigate using a workstation to display the operating status of a fission reactor. Or medical equipment companies who'll make instruments to monitor patients in surgery. Or manufacturers who desire to control time- and position-critical processes in a steel mill. When one builds a skyscraper, it is anchored in bedrock not in mud. I believe that this good-enough-for-programmers-so-its- good-enought-for-everybody attitude is unacceptible when the products of these programmers are actually used by the rest of the world. In taking this opportunity for a little bombastic opinion, I hope I was able to adequately describe the protocol deficiency as I understand it. monty
60.35	re: .30	STAR::BRANDENBERG	Intelligence - just a good party trick?	`Fri Feb 10 1989 12:37`	14
	(I've been stewing for two years but I'm feeling better now.) Yes, I'm not too happy with the design but I can be fair. The "Boys From Cambridge" didn't set out to solve the world's display problems so many years ago (at least by my understanding). They created a system that was built for programmers and students and it may be adequate for that purpose. I certainly like to use the tools and the environment for my work (programming). But first by accident and then by executive decree, it was decided to make a commercial system out of this that would solve the everybody's needs. I am personally uncomfortable with the way in which these decisions were made. monty
60.36	Well, look at where the market is putting its money	POOL::HALLYB	The smart money was on Goliath	`Fri Feb 10 1989 16:29`	12
	Nor is this the first example of the marketplace demanding an inferior product. Your PC (Apple or IBM) crashes? Oh, well, reboot it and get on with things. These kinds of problems are seen to be like cars stalling then starting. No big deal, it costs too much to engineer perfection. Nuclear reactor operations? We'll buy two. They won't both fail just prior to meltdown. Etc. John
60.37	Slight time warp Slight time warp (Old noters: remember those?)	DECWIN::FISHER	Burns Fisher 381-1466, ZKO3-4/W23	`Fri Feb 10 1989 16:48`	3
	For the record, .36 and .37 replace some notes which were deleted (29 and 31, I think.) That is why the context and order seems a bit funny.
60.38	The inevitable follow-up questions	WINERY::ROSE		`Fri Feb 10 1989 19:13`	15
	RE .36: Thank you, this is very interesting. Disclaimer: These are questions -- not arguments. I'm trying to understand your note, not rebut it. Are you contending the following? It is impossible to write a server that does not hang if a client hangs and if enough events occur that are directed to that client. You say this is even true over TCP/IP, just that the probability of hanging is much lower under TCP/IP on Ultrix? Is that because TCP/IP on Ultrix allows more stuff to be in the pipeline undelivered? An even more general question: What is the simplest change to X that would make it possible to write a hang-free server?
60.39	The Ultrix server seems hang-proof	FLUME::dike		`Sun Feb 12 1989 11:38`	10
	I checked the Ultrix sources, and it doesn't look like the server is capable of hanging. The server sets up connections so that if a read or a write would block, the call returns immediately with EWOULDBLOCK. If the call was a read, the server services other clients until the rest of the data comes through. If it was a write, the client is punted. I don't intend to claim that anecdotal evidence amount to proof, but I have never heard of an X server on Ultrix hanging in a read or a write. Jeff
60.40	The problem is not in managing the line but the resources consumed by the server.	IO::MCCARTNEY	James T. McCartney III - DTN 381-2244 ZK02-2/N24	`Sun Feb 12 1989 18:18`	23
	RE.: .41 Consider an application that enables mouse events, then promptly goes off to "sleep" (ignores making a call to get the next event). The process may actually be doing something useful (like an FFT or Finite Element model). Meanwhile, the impatient user is idlely dragging the mouse around generating 1000's of events per minute. The server, attempting to preserve these events, is packaging them up as quickly as it can shipping the out to the client. Eventually, the clients network buffer fills, the network transport layer screams "No more..." and the server has to decide to buffer it locally, or to drop things on the floor. Early servers, attempted to do no buffering and simply aborted the link, causing intrinsic reliability problems. I can't speak for the existing VMS and Ultrix server's having not seen the code, but I believe that this is one of the problems to which Monty is refering. In extremely severe cases, it is possible that the server will exhaust it's resources trying to buffer events locally, and thus hang. Until the dormant program gets around to reading it's event queue, nothing can be done on the server. James
60.41	%DECW-F-IPI-Insufficient programmer intelligence failure at ...	IAGO::SCHOELLER	Who's on first?	`Mon Feb 13 1989 10:16`	9
	re: .42 That is why we have been frequently reminded to not write progams that disappear for a long time without check the event queue. A small amount of intelligence on the part of the application developer prevents this client from being punted. Dick
60.42	Look at what has come before	STAR::BRANDENBERG	Intelligence - just a good party trick?	`Mon Feb 13 1989 11:30`	71
	re .40: Is it possible to write a hang-free server? If it is not acceptible to drop a connection at the first sign of a hang, then I believe it is impossible to write a reliable, hang-free server. In previous replies (to which I will respond shortly) note how recovery takes place: if a server write blocks, drop the client. Most low-level networking protocols implement some sort of quota system (windows or debit/credit or ... ) in the protocol itself. The X protocol implements it in the operating system interface (if it doesn't fit, kill the connection). This is one thing that must change if we are to have a reliable server. There are at least two ways that this can happen: either by changing the protocol and server semantics to include a debit/credit system for server-to-client communication or by changing them to allow unreliable delievery of events. I'll consider the latter first. In certain areas, the X server has already made some movement in this direction. With the realization of how large a load can be generated by mouse motion events, the designers created a "motion history buffer" in the server. If we're generating events too quickly, and the client allows, put motion events in this buffer and report to the client, via events, that there is something interesting in the motion history buffer. While this implementation is along the lines of an infinite buffer approach, look at what they're really doing: 1. Server attempts to send report and fails (or might fail). 2. Server stores state change (mouse motion). 3. Server reports to client availability of state change (motionHints is non-zero or whatever). 4. Client synchronously requests report of state change (getMotionHistoryBuffer). Generalizing this and changing the implementation, would give a server that doesn't insist on sending every single event and a chance at a reliable, hang-free server. Or, how about a debit/credit system? Xlib could piggyback event credit values on requests. An initial maximum could be inferred from the networking quotas and that particular networking interface. This still implies a server that isn't required to send events or one that is able to encapsulate state changes to be sent later. Another way is to change the communication model to something along the lines of an RPC. Asynchronous client requests could still be asynchronous but server state changes (i.e. events) would be acquired mostly synchronously. There might still be an event credit to retain interactivity if it is shown to be necessary but Xlib calls such as XNextEvent would become request/reply pairs. These are just some ideas, nothing has been tried but I think these are interesting avenues to pursue. As for the tcp/ip vs decnet and ultrix vs vms issues... it's more a matter of programming interface than either base protocol or host operating system. VMS tcp/ip (connection), Ultrix tcp/ip, and Ultrix DECnet all work pretty much the same: they're byte-streamed, socket-derived interfaces that buffer user data in an almost pure byte limited fashion (I believe, 4K per direction and per side is the default in all the above mentioned implementations.) VMS DECnet, on the other hand, has a record-like quota system based on segments with a $QIO more-or-less generating at least one segment. The byte-stream model allows a client and server to run skewed which is to some degree a requirement in any distributed system. (Imagine trying to pipe some shell commands together if the byte-quota on a pipe was, say, one byte.) Unfortunately, this model is also tolerant of protocol design failures. Architectures which are intrinsically deadlocking appear to work simply because the deadlock condition is unlikely and the allowed response to a deadlock, if the interface allows it to be sensed, is to give up. Just some thoughts... monty
60.43	Oh, yes, an experiment.	STAR::BRANDENBERG	Intelligence - just a good party trick?	`Mon Feb 13 1989 11:38`	16
	re .40: I've had an idea for an experiment for sometime but I can't get the resources to perform it. The idea was to get two ultrix machines on their own ethernet and setup an X test environment that would allow be to create an arbitrary cpu load on either a server or client machine. I would then vary two variables, the load on a system (either server or client) and the mbuf quota for links, and observer and measure the reliability of various interactive applications. My belief is that connections will become markedly unreliable as quota is dropped. My contention is that there is no threshold at which a connection becomes reliable; that there is only a curve giving probability of failure which is never zero and which is a function of so many variables that we can never say "you're safe." monty
60.44	Hang-proof isn't the same as reliable	STAR::BRANDENBERG	Intelligence - just a good party trick?	`Mon Feb 13 1989 11:48`	8
	re .41: You are absolutely correct in the ultrix case. But first, only Unix has the nice FNDELAY option and must this be used to implement the protocol and server semantics? And second, it doesn't hang but is it reliable? Can't Joe Customer have both? m
60.45		STAR::BRANDENBERG	Intelligence - just a good party trick?	`Mon Feb 13 1989 12:16`	36
	Re .43: This is an extremely poor attitude to take. I've already complained that X must rely on networking implementations to survive (an inappropriate mixing of levels) and now you're suggesting that the remaining slop be taken care of by the application programmer. By the goodness in our hearts, we'll make this work? Truthfully, what justification is there for a call to ProcessInputEvents() in the outer loop of a 2D FFT? Or an image convolution? Or a large, atomic database transaction? Or any of the other things that makes money for our customers? I could argue from aesthetics (it's ugly), or structured programming paradigm (it's mixing levels), or from performance (it ruined the register optimizations), or from programmer convenience (they have to do everything), or from a quality assurance standpoint (more and more testing just to see if they can keep X alive). And I claim it still isn't enough. The server can't control the application environment. The application may be on another machine, on another operating system, in another country. Well, neither can an application programmer completely control the application environment. The programmer can't control when his process will be scheduled, can't control taking a page fault served by a crashed nfs server, can't control slow or overloaded or unreliable networks, etc. etc. etc. The application programmer tries to get his algorithms correct and relies on the correctness of the system software to get the rest done. Is the programmer's trust well placed? We are trying to create a reliable, distributed, interactive, graphical system. (Those four adjectives are very important.) I believe this is the single hardest networking problem anyone has yet seen. It's more difficult than the base networking support (tcp/ip, udp/ip, decnet, whatever), rpc's, remote terminals, distributed filesystems, naming services, etc. And I think it is not yet solved. monty
60.46	The future's not bright so take off your shades	STAR::BRANDENBERG	Intelligence - just a good party trick?	`Mon Feb 13 1989 12:25`	14
	Those who can begin to see the stochastic nature of these systems might think about the future. The range of networking speeds is increasing. Some people insist on serial line interfaces to X while others are preparing for FDDI and HSC. The range of CPU speeds is increasing. Two years ago, everything was pretty much one- to three-mips. Servers, clients, pc's, routers, hp handheld calculators, etc. Now we'll have Cray's, Connection Machines, Multiflow's, DAP's, MIPS boxes, SMP vaxes, on down to 68000-based X terminals. This reliability curve I mentioned (really a reliability manifold) is dependent upon all these variables and others. What is it going to look like in the future? monty
60.47		KONING::KONING	NI1D @FN42eq	`Mon Feb 13 1989 12:33`	5
	Note that many of these would be non-problems if the operating systems we use had decent multithreading facilities built-in. paul
60.48	You've just moved the deadlock	STAR::BRANDENBERG	Intelligence - just a good party trick?	`Mon Feb 13 1989 12:52`	17
	Re .49: Do you mean for use by the server, one thread per connection? If so, I think not (though others in VMS think it would be wonderful). The problem is that clients intentionally and necessarily interact with one another. They share real estate, keyboards, colormaps, etc. and when one client changes these, the others may need a report. XSendEvent, properties, and selection encourage communication between clients. And, because all these resources are shared, the database which maintains them is also shared. And then there are clients which require atomicity across multiple X operations (such as the window manager) hence locking out other threads. All this communication between clients implies locking, if a client needs a lock held by another client who is blocked by transport, that client will also block. Conclusion: server deadlocks can still occur. monty
60.49	Insufficient Architecture	EVETPU::TANNENBAUM	TPU Developer	`Mon Feb 13 1989 13:21`	16
	Re: .43 Yup, DECwindows requires that an application frequently check the input queue. TPU had to jump through hoops to implement this. And it's still not right. I recently found that TPU's not checking the input queue while a subprocess is running (so don't do anything large in a subprocess and then wiggle your mouse on a DECwindows EVE window). How many other places have we missed, simply because no one considered yet another obscure area of the code? It would be a LOT easier if this was handled once, correctly, instead of trying to duplicate it in every application. - Barry
60.50	?	WJG::GUINEAU		`Mon Feb 13 1989 15:58`	13
	Funny, My first use of X (DECwindows) was for an application that would go off for more than 1 hour as a result of one mouse click. While it was gone, the interface was dead! After a few contortions and mucho help from this notes file, I got it all working by spreading ProcessXQueue(); calls all around the "work routine". I never suspected the far reaching implications this really had, but figured there must be a better way (like have a separate thread do the X Queue Processing asynchronous to the rest of the application.) John
60.51		KONING::KONING	NI1D @FN42eq	`Mon Feb 13 1989 17:53`	14
	Right. I was referring to the application, not the server. On the server side, there has to be a better way too. For example, events could be discarded when there are too many pending transmission to a particular client. Such flow control would of course have to be on a per-client basis. Then when the flow starts again, the client would receive a "you just lost some events because you were too slow" event along with the subset of real events that was kept. (You may recognize this approach -- it's the one used in DNA for event logging.) It may or may not be appropriate for the server to provide some feedback to the user (bell, or some such?) in addition to the events-lost event that goes to the client. paul
60.52	Thought about that, too.	STAR::BRANDENBERG	Intelligence - just a good party trick?	`Tue Feb 14 1989 10:13`	8
	We argued the possibility of an "events lost" event but the problem is with recovering the state change information from the server. These changes are quite complicated and must be retained in some form for a client to keep it's environment in order. m
60.53	So, how about some feedback?	STAR::BRANDENBERG	Intelligence - just a good party trick?	`Tue Feb 14 1989 10:36`	2

60.54	DECW-F-NONMODULAR Program author not aware of DECwindows in 1967.	IO::MCCARTNEY	James T. McCartney III - DTN 381-2244 ZK02-2/N24	`Tue Feb 14 1989 16:13`	79
	RE: .43 I don't suppose that your are suggesting that we call the authors of packages like IMSL, SPSS, STRUDL, CHEATAH etc. and inform them that their carefully optimized matrix operations take too long. When we tell them that they should break up their routines for DECwindows applications (because we're incabable of building a robust server that avoids such complications), their reaction will be the same as mine - laugh and go find a hardware vendor that build computers not toys. If they had wanted a toy they would have called MATEL. Seriously, if we can't solve the problem of flow control on the X event queues and come up with a realistic interpretation of what to do when the transport becomes clogged, we will have some very unhappy customers. Some of their sources have been in existance since the middle 60's and the programmers that wrote the codes may have actually retired! Cracking open all these dusty decks simply because DECwindows comes along is not a good reason. (This assumes that one calliously disregarding the modularity concerns is a viable option. Since we've heard over and over from these vendors: "Give us faster hardware, better and more interactive interfaces, but don't make us rewrite our codes.", we know it's not!) RE: .55 Feedback: Complete agreement with ideas expressed so far. The only thing that still needs some discussion is what to do about the "lost events" event. I see the problem with the need to keep the application and the server in sync, but the hang (or hang-up) solution is definately not adequate. If an application was to get a "lost events" event, would it not be safe for the application to assuem that it should initiate it's own recovery mechanism? For instance, unmap all windows and remap to restore "correct" appearances? How does discarding input events cause problems? Applications already know how to tolerate typeahead buffer overrun. Simply droping mouse or keyboard events that cannot by buffered should be sufficient. This behaviour is (I believe) consistant with existing experience and provides a system that will degrade with diginity. Some special feedback mechanisms needs to be provided by the server to ensure that this overrun condition is quickly detected by the human operator. I believe there are only three different mechanisms that must be provided, keyboard event loss, locator motion event loss, and locator button event loss. For for keyboard event loss, simply ringing the bell al� the terminal driver is sufficient. This same mechanism may also be useful for mouse button event loss. The difficulty is to find reasonable feedback for the locator motion event loss. For locator motion, we want to preserve the ability to move to another application and continue work there, after all, concurrency is one of the good things that workstations provide. Also the application we might be moving to is our "hot backup" of the session that has encoutered overrun problems. Given that you accept these design parameters, we obviously cannot just ignore locator motion input. We must also track the cursor location on the screen accurately, so we can't just refuse to update the cursor. This leaves only two variables, shape and color. Perhaps we can define a cursor shape or color which can be interpreted as "locator events being discarded" Perhaps the locator cursor could alternate between two different shapes in this (abnormal) case. I don't know what the best answer is for this problem - comments? As to what an application should do for lost events, we can easily answer these questions. If the keyboard events are discarded, it will be as if the user never struk the key. The application will be unaware of the lost events. For locator button events, especially timing sensitive double and triple clicks, the lost events will not be in the data stream but the "lost events" event will be. The application can take action based on this new event type - usually to ignore any partially completed operation. For locator motion, applications already have to be able to process non-linear motion since the tablet reports position and not relative information. I admit that accurate locator button tracking is difficult, especially since there are timing windows with can cause a lot of pain. For instance consider the problem of what happens when you are in a marginal network condition, have down clicked to make a pull-down selection, started moving the mouse, buffer overrun occurs, you continue to move the mouse (discarding events), buffer overrun clears, and you release the button. Unless the application is careful, this situation can lead to disaterous results. Comments?
60.55		KONING::KONING	NI1D @FN42eq	`Tue Feb 14 1989 17:34`	8
	Clearly the crudest possible response for an application that receives an "events lost" would be to give up. That would make it no more crude than the present approach. Of course applications can do better; how much better depends on the application, the skill of the designer, etc. I'd certainly go along with the comments in the preceding response. paul
60.56		PSW::WINALSKI	Paul S. Winalski	`Tue Feb 14 1989 17:56`	9
	I like the idea of an "events lost" event. The author of an application knows which events the application has elected to receive. The application is in the best position to determine whether the loss of events is recoverable or not--right now, it is the server that decides (and it always decides that an event loss is unrecoverable). My educated guess is that the vast majority of "events lost" events would indeed be recoverable by the application. --PSW
60.57		MYVAX::ANDERSON	Dave A.	`Tue Feb 14 1989 18:16`	7
	To make the decision easier for the application, report what type of events were lost (keyboard input, mouse motion, mouse button, etc?). This requires maintiaining only a negligible amount of additional state information. Dave
60.58	More ideas	DECWIN::FISHER	Burns Fisher 381-1466, ZKO3-4/W23	`Tue Feb 14 1989 18:19`	19
	For "events lost" we should probably allow the client to say something about what events he can tolerate loosing (a hint, presumably), and also, the "events lost" message should probably tell something about the nature of the events. For example, if the client knew that the messages lost included mouse motion and expose, it could completely repaint itself and Query the mouse position. BTW, there is a conference discussing X protocol change proposals. It's not very active, but maybe it should be. BTW2, I would like to hear some more discussion of why/why not this is a problem on TCP. If I were to lobby for something like this, I would need to make good arguments to Unix people. (Don't take that to mean that good-ole-Burns will get this little protocol thing solved for the next version. This would take more than a little deep thought, argument, and lobbying) Burns
60.59	It applies to EVERY transport	KONING::KONING	NI1D @FN42eq	`Wed Feb 15 1989 12:22`	30
	The problem is clearly independent of transport. It applies equally well to TCP/IP, to the local transport, and so on. After all, the problem isn't really the transport at all. The problem is application level flow control: the possibility that the server is generating data (events) faster than the client is accepting them. As things stand, the application layer flow control is mapped into transport layer flow control, since the client stops issuing Transport receive requests, which eventually blocks Transport send requests at the server. So the server application ends up with data that it can't send. It's usually well understood that distributed applications require flow control to bound the size of the queues. There are a couple of possibilities in the general case: 1. Design the receiver such that it is guaranteed to run at least as fast as the sender. 2. Have the sender stop generating new data when the queue is too large. 3. Discard data when the queue is too large. X does none of these; it uses the "off with his head" approach. Given the properties of X, #2 is not possible (event generation is controlled by the user at the keyboard/mouse, not by the server alone). #1 is also not practical, so that leaves #3. Note that I didn't mention DECnet anywhere in this discussion; it's all transport-independent. (Or you might say that the whole discussion was in the application layer, not the transport layer.) paul
60.60	I think "KISS" is the necessary magic	BORA::MARTI	Beat Marti - ISV Support - MR4-1/H19 - 297-3074	`Wed Feb 15 1989 13:50`	18
	The problem is not one of transport. It should also not be left up to the application (or application programmer sprinkling silly event queue flushes all over the code) to solve the problem. It seems to me, that the only place where we can think about some reasonable solutions is right where the problem occurs - at the server. I don't see anything wrong in stealing the idea from the terminal handlers which simply ring a bell when the buffer overflows. How about if the server would simply freeze the pointer, or better yet - change the shape of the pointer similar to the watchband - within the windows of the application which is to receive the events which are going to be dropped. In addition, make sure that any mouse clicks, keyboard inputs or such actions directed to that application result in some easily identifiable response, maybe something like rining the bell. I don't know how complicated it would be for the server to implement such functions - but the concept definitely seems simple enough....
60.61	LostEvent	STAR::BRANDENBERG	Intelligence - just a good party trick?	`Wed Feb 15 1989 14:41`	119
	re .61: Beautifully stated. The mapping of application flow control onto transport flow control is precisely the problem. Transport implementations have different flow control and so X appears to operate differently on different transports. However, the fault lies with the protocol design and server semantics. (An aside: it is truly a pleasure to be talking to people other than myself for once. Thanks to all for your participation.) I'll conclude from the replies that true reliability and not probable reliability should be a goal for a DECwindows server. I concur with Paul Koning's conclusion that this implies that the server may drop data it attempts to send to a client. There are two types of data which a server may send to a client: replies and events (errors are encoded as events). What is the "best" way to handle each type? I'll consider replies first. A reply is generated in response to some client request and so there is some indication that the client will try to cooperate with the server. But what if sending some reply should block? I see three possible responses: 1. Drop the connection at the first sign of congestion. This certainly guarantees that the server never hangs but it isn't really reliability. The protocol will in theory allow replies to be as large as 16GB. How well the client is able to sink the reply data from the server will depend upon what is being sent, how well the network is operating, whether page faults are being serviced, the relative speeds of the client and server machines, how the client is being scheduled, and a host of other factors. The client may be trying to read the reply but the program environment, which it can't control, may not allow the client to keep up with the server. 2. Guaranteed transmission of replies. This will ensure that any "best effort" client will receive a reply but now the server will hang until a reply can be buffered by transport. I've already given examples where this time may be unacceptibly large. 3. Best effort attempt. How long can we allow a server to hang in an attempt to transmit solicited data to a client? Decide this and use it as a timeout on reply transmission. This doesn't give 100% reliability but we now can quantify the amount of time a client can take to read a reply if it wants to retain its connection. I prefer this choice. Onward to events. What is done here will have far-reaching implications on the whole decwindows engineering effort. There are some applications whose compute tasks are so large relative to the user interface component that they won't mind rebuilding the interface should something be lost. Other's will be mostly user interface and will want to do as little as possible to recover from a gap in event transmission while still being reliable. The most interesting of this latter type, I believe, is the toolkit itself. With that in mind and a predilection towards the "lost events" event and some experience with the intransigency of those who control the protocol, I'll consider that possibility. Review the protocol manual and read the x.h and xlib.h files to see what kinds of events are being generated and what they cause. It was suggested that mouse motion events are the primary candidates for encountering congestion but they are not all. Mouse motion can also generate Enter/LeaveNotify events for windows up and down the heirarchy. The offending mouse motions could have been part of a button down sequence so that not only are mouse motions lost but also that all-important button-up event. Add grap/ungrab events to this mess, also. Then, there are the ConfigureNotify and Expose-type events. These are usually caused by another client (the window manager) and failure to respond to these will certainly cause ugly holes. Also, some of these events are "counted" events. I.e., they contain fields which count down the number of events which an application may reliably expect but which may be lost in the new system. Then, there is the Brave New World of events defined by unimagined extensions. The amount of state that needs to be kept and transmitted to the client isn't that small. Basically, an application must be able to enquire as to the exact state of the server or at least be able to return it to a known configuration. Without thinking too hard, I'll take a stab at such an event and what kind of support it will require. It is probably interesting to know what type of events were lost. There are 128 possible events (256 if one wishes to distinguish "natural" events and those sent with XSendEvent). 128 bits requires four longwords. The event header is another longword. Since this event encompasses a range of activities, the application may want to know how long it was out. If so, reserve and additional two longwords for either timestamps or full sequence numbers to indicate when events began to be lost and when event transmission (always beginning with this event) resumed. We now have seven longwords of an eight longword event packed used. The remaining longword could be used for modifier and mouse button state at the time the event lost event is transmitted. (NB: This event is connection-wide: it does not associate with any one resource.) What kind of response might a client want to take on receipt of such an event? It could: 1. Give up. Currently, the server does this for it but now a client will have to do it itself. 2. Update everything. This means repaints, dropping active and passive grabs, fixing keyboards if they've been changed, etc. Consider all the resources that may be involved and this may be a very time-consuming task (as in the case of the toolkit). A review of the Xlib interface is needed to ensure that we can restore to a known state. 3. Intelligent/Selective Update. The application needs to perform query operations to see what what changed. We may need new protocol requests to query the window layout (as visible not as defined by the application), GC's, colormaps, and other resources. Extensions must provide equivalent functionality. Additional work is needed in the server up through every application. Comments? Monty
60.62		KONING::KONING	NI1D @FN42eq	`Wed Feb 15 1989 14:51`	41
	.63 says part of what I was in the process of replying to .62... Re .62: It's not that simple: there may be multiple clients using the same server. (In fact, there just about always are multiple clients.) The property you MUST have is that one client's lack of progress does not block other clients. So you can't simply stop accepting keystrokes, or mouse motions, or whatnot, since some of those inputs may be going to clients that are operating correctly. And of course some events are generated by the actions of other clients: if client A deletes a window, client B may receive an exposure event. Clearly it would not be valid to prevent A from deleting that window. If events have to be discarded, and the events are input (keystroke, mouse) events, then a bell or some similar feedback may be a good idea. But with or without that, I believe an "events lost" indication is essential. If an application wants to take a head-in-the-sand attitude it can simply ignore such events, though this would tend to result in low quality applications. A full repaint (treating events lost as a full exposure event) is probably the minimum that makes sense. As .63 points out, restoring ALL the state may take a lot of work. There's probably a subset of the state that could be restored efficiently; something on the order of what is restored on deiconize. (Then again, I may simply be showing off my ignorance of the complexities of X here.) As for the suggestion to provide some more detail on the events-lost notification (e.g., classes of events that were lost): that might be useful, though I suspect most applications wouldn't make use of that. The fact that anything at all was lost would be grounds for recovery actions; since the application isn't supposed to be falling behind and losing things as a normal operating mode, you wouldn't want to make those recovery actions all that sophisticated. There's a rule about "this shouldn't happen" type of error recovery code -- it says that such code in fact doesn't work in the field, since it's not tested during field test, certainly not in all its permutations. This argues for keeping the lost-event handling code simple, since in most applications and most configurations it should be rare. (Another way to justify that it should be rare is that this event, when it occurs, disrupts the user interface. So a human factors argument says that it must not occur often.) paul
60.63	Getting back to the problem, if not the subject	POOL::HALLYB	The smart money was on Goliath	`Wed Feb 15 1989 16:54`	20
	Designing protocols can be fun, and you guys are doing such a great job that I don't need to make any contributions. But I am worried about what appears to be the harder problem -- those long-running applications that don't want to change their code. It seems to me that if you have a developer who's going to make use of "lost events", repaint the screen, clean up etc., then you probably have a developer who's going to write good enough code so that the problem doesn't arise in the first place. But what do we do about the application that goes in to a black hole and ignores events for a long time? Should we provide developers with real fast test-1-bit type instructions (if set, call the event queue processor)? Or should we provide some way (like ASTs, but not ASTs) to sort of force a reluctant application to process events? It should be OUR desire to make DECwindows such an attractive system that ISVs will want to use it. Forcing X calls into application loops isn't the way to advance that cause. John
60.64		PSW::WINALSKI	Paul S. Winalski	`Wed Feb 15 1989 17:24`	39
	RE: .65 I don't see where the case you're referring to is a problem. Suppose we have an application that does some DECwindows setup, then calls a subroutine that does several day's worth of number crunching, ingnoring the X event queue the whole while. Upon leaving that subroutine, it updates the DECwindows screen. What happens now is that if the event queue fills up, the server drops the connection and the application bombs once it leaves the subroutine. If the "events lost" event is added, the application can find out that this has happened, if it wishes to, and can take corrective action. If the programmer ignores the "events lost" event, then it's possible that there might be misbehavior. So what? At least with "events lost" events, this sort of application can recover from lost events. With the current protocol design, it cannot. Note also that the check for events doesn't have to be in the number- crunching subroutine this way--we haven't forced the programmer to turn his application inside out. Regarding getting this change accepted by the X Consortium--the "events lost" event seems to be in keeping with the general X philosophy of pushing work back on the application. Just as an application must decide if exposure events are significant and if so, process them, "events lost" events put the decision on whether event buffer overflow is significant in the hands of the application, not the server. If the application chooses not to handle such events, it can either ignore them or abort. One would expect the DECwindows Toolkit to recover from such events, of course. A properly-designed server that is supposed to handle more than one client simultaneously should never let itself get into the situation where a flow control problem with one client blocks the entire server. This can be done without the server imposing any kind of timeouts on client connections. Link breakage detection and timing out of connections should be the job of the underlying virtual circuit transport on which the server/client communication is based--it should not be done by the server itself. --PSW
60.65	I think I can answer the question about why TCP is less affected	RIGGER::PETTENGILL	mulp	`Wed Feb 15 1989 20:37`	34
	TCP (it doesn't need to be IP, but usually is) is byte stream oriented. This means that the application needs to provide its own record framing (which isn't usually much of an issue) and `interrupt messages' are sort of a kludge (if you don't have records, how do you know where to insert data that is supposed to skip to the front of all other data without hopelessly confusing the application). However, X fits TCP well (for hopefully obvious reasons) since it has its own `record structure' and it doesn't use interrupt messages. So, how does this help ? Being a byte stream protocol, TCP is geared to handling a byte stream. It's flow control units is bytes, not records, and it gets to decide for the most part when to transmit data based on either a timer or some fraction of its buffer being filled. When it transmits a datagram (usually IP but not required) the datagram includes the starting byte in the current window and the number of bytes in this segment. On the receiving end, the message must ALWAYS be processed, even if (some of) the data has been received (and passed to the user and acknowleged) before. This means that when a TCP connection has a byte quota of 6000, the connection won't stall until all 6000 bytes of the buffer are filled. It is possible to write 1 byte at a time to the TCP socket and without any ack from the other end, send 6000 datagrams ranging in size from 1-6000 bytes long (IP datagrams can be as large as 8kb). TCP doesn't need to keep around 6000 copies of the datagrams in actual or virtual format to operate. As I understand the VMS DECnet implementation, the pipeline quota in bytes is simply used to compute the number of outstanding datagrams that will be used. Something along the lines of 10000/576 -> 18 datagrams. If a write is done for 1 byte, then one datagram is used, and it is possible for 18 bytes of data to consume the 10000 bytes of quota. I'm being extreme, but in the case of mouse events, I expect that no matter how fast a user is, each click results in a very small amount of data (25 bytes) being written which will be sent in a separate DECnet datagram. (As I said, I don't understand this well, or maybe not at all....)
60.66	Oops, I missed the obvious on TCP	RIGGER::PETTENGILL	mulp	`Wed Feb 15 1989 22:32`	36
	I just did a little checking of what messages actually get sent and realized that I missed the obvious about TCP. The receiving end of TCP doesn't need to worry about keeping track of record boundaries, so it can simply stuff everything in one buffer. In the case of the VMS Connection, it normally has a receive and transmit buffer size of 4096. After establishing a connection (DECW$CLOCK) and then making sure that the client would not process any events (^Y) I generated events and watched how the server system sent about 60 datagrams which filled the 4096 byte buffer on the client system (average of about 70 data bytes each). Then I watched as the 4096 byte buffer on the server filled. About 30 seconds after the server buffer filled, the server killed the connection to the client. Each datagram was ack'd by the client system. Until, both buffers were filled, the server continued to function normally. In contrast, when I did the same using DECnet/VAX, the server sent about 18 datagrams (each ack'd) averaging about the same size as the TCP datagrams and then the server stalled. About 30 seconds later the server terminated the connection. The DECnet system had a pipeline quota of 10000. This suggests a partial solution; since DECnet won't make efficient use of its buffers (ie., using 1500 bytes to store 60-70 bytes of data), the DECnet transport module needs to do it. On the client side, it could do its input I/O with ASTs and read into a buffer from which it passes data to the Xlib code. As long as its is able to get ASTs, it will be able to keep the server happy until its buffer fills. Similarly, on the server side, it needs to make sure that it nevers stalls and when DECnet won't accept any more data, the transport module needs to move the data into its own buffer. This is certainly a hack, but I believe that it would only need to be an interim hack until the Phase V interface becomes available; I'm guessing, but I suspect that it may help in this area. If not, then this is the kind of info Tom Harding et al were looking for a few months back when they were asking what advantage a stream interface offered and should they support one.
60.67	Some events are more equal than others	DSSDEV::TANNENBAUM		`Wed Feb 15 1989 22:45`	29
	Even with the proposed changes, DECwindows would still be missing an important feature available in the terminal world. Applications aren't always well behaved. If my application runs away, I want (need) some way to get control of it without necessarily blowing away the process. I may have invested a lot of time and effort in my current application state. I want to save it if at all possible. Even if the a "lost events" event is added, TPU will still need to poll the input queue periodically to check for ^C's. It's too easy to put a TPU-based application into an infinite loop. For example, type TPU a := 0; LOOP a := a + 1; ENDLOOP at EVE's command prompt and watch TPU count to infinity. Our first attempt at dealing with this resulted in our asking XLIB for an AST for any keyboard character. Performance was abysmal. Users type lots of keys at a text editor. Currently we have an AST that checks the input queue once a second (XLIB can be called at AST level) and sets a flag if there are any events pending. At the top of our interpreter loop, we check the flag and call a routine to dispatch any pending events if it is set. (The tool kit can only be called from non-AST level) Imagine trying to debug an application that goes into an infinite loop without being able to type ^Y DEBUG... - Barry
60.68	Events lost event rejected by MIT in the past	STAR::BMATTHEWS		`Thu Feb 16 1989 05:24`	5
	An events lost event was proposed to the X11 developers during X11 development and it was rejected so I am not sure how likely it is to get this into the protocol. Bill
60.69	X12R1	STAR::BRANDENBERG	Intelligence - just a good party trick?	`Thu Feb 16 1989 09:40`	28
	Getting such a change accepted by the Consortium is going to be a huge task. This change is more than just adding a new event packet. Here are some of the implications: 1. Throw out the event section of the protocol manual (which is both a protocol and a server specification). In the future, event delivery becomes unreliable and so there will be no guarantee the count fields will be honored or bracketed state changes be undone (such as buttonRelease after buttonPress, ungrab after grab, etc.). An "eventsLost" event will require an application to either give up or recover server state, much more involved than exposure handling. 2. Toolkits, Widgets, and any other programming or environment tool will have to handle this event gracefully if an application using it is to be reliable. And not just DEC's toolkit: Athena's, HP's, and every Tom, Dick, and Harry, Inc. that makes an X Windows System. 3. Rewrite event handling in applications. All applications. Everybody's applications. Item 1 is a significant enough change to call this "X12." The Consortium will have to be pushed hard (or off a cliff) to get this change accepted. After all, how many programmer's care about reliable systems? m
60.70		STAR::BRANDENBERG	Intelligence - just a good party trick?	`Thu Feb 16 1989 09:44`	13
	Re .68 "partial solution": This is part of what common transport attempts to do. There is obviously a tradeoff between ability to achieve a steam-like appearance and the CPU cost of performing the data copies. I chose a point that leaned too far towards performance and not enough towards streams. I currently have some transports running that perform more copying on writes and this has improved reliability. The performance impact is not yet known. m
60.71		KONING::KONING	NI1D @FN42eq	`Fri Feb 17 1989 11:19`	32
	Stream transports can make the problem appear less quickly, but clearly can't eliminate the problem. Re .71: the impression I get is one I keep getting over and over from certain areas: that high quality is a non-goal. "Good enough for programmers" is all that is considered necessary. UGH. I also don't think the arguments hold water. "Event delivery becomes unreliable." Sure -- but it already IS unreliable. In all the cases where it is reliable currently, it will continue to be reliable. In all cases where it is currently so unreliable that it blows the application completely out of the water, it continues to be unreliable. The only difference is that the error is no longer a fatal error but one that applications can, if they wish, recover from. Currently, the error is fatal and applications are not given the option to recover no matter how much they may want to. There is no compatibility problem. Any application that ignores the event will not be any worse off than it is now. Depending on what it would have done had it not chosen to ignore the event, it may be very much better off. Any application whose developers take the trouble to do some work to process the event is improved in the process. In other words, you can't lose. It is an absolute improvement for every application. Re .65: what to do about applications that don't want to redesign their code to guarantee that events are processed quickly enough to avoid event loss: that's where a proper multithread support will help. Put the application in one thread, the event handler in another, and you're done. (Well, close, anyway...) paul
60.72	You can't poll for events often enough, ever	PRNSYS::LOMICKAJ	Jeff Lomicka	`Fri Feb 17 1989 13:26`	15
	After what happened to me yesterday, I am convinced that the current X transports cannot be made reliable on VMS unless you check for X events between EVERY INSTRUCTION, perhaps more often than that. You see, yesterday a machine running a client of my workstation went into a long cluster transition. Need I say more? I will anyway. I beat on the keyboard and mouse a bit, and sure enough, my entire STAND-ALONE workstation was hung until the server decided to trash the offending client, then I could proceed. My gut reaction to this entire discussion is "how could anybody be so ignorant as to ignore the flow control problem here".
60.73		STAR::BRANDENBERG	Intelligence - just a good party trick?	`Fri Feb 17 1989 13:45`	18
	re .74: >My gut reaction to this entire discussion is "how could anybody be so >ignorant as to ignore the flow control problem here". Say one of the following in a whining, geekish voice: 1. "It's too haaaaaard to solve." 2. "I don't have any problems; there must be something wrong with the user or programmer!" 3. "What flow control problem?" 4. "Zzzzzzzzzzzz. Snort." monty
60.74		VWSENG::KLEINSORGE	Toys 'R' Us	`Fri Feb 17 1989 13:58`	6
	As one of the x11-high-and-mighty started a mail message to me two years ago: "Any competent programmer..."
60.75	it can be done compatibly	PSW::WINALSKI	Paul S. Winalski	`Sat Feb 18 1989 15:19`	51
	Sorry, but the line of reasoning in .71 is faulty. Taking the points in order: 1) Addition of an events lost packet does not mean that event delivery becomes unreliable. As Paul Koning pointed out, event delivery already IS unreliable. Events lost continues to be an error condition, as it is today. The only difference is that an events lost packet lets the client decide if the condition is severe enough to warrant aborting the connection. Today the server decides unilaterally that the condition is always fatal. Applications that cannot deal with the condition for any of the reasons that you cite are perfectly free to handle an events lost condition by aborting the connection. The difference here is that those clients that CAN handle the condition are able to do so. 2) Toolkits should handle the condition gracefully to be of maximum service to the user. Those that choose not to handle the condition gracefully will offer service that is exactly like it is today. 3) The change can be done compatibly, with no rewrite required in existing applications. The way to do this is to make enabling events lost notification optional. This could be done in either of two ways: o add a routine, call it XSetFlowControl(). This would be analogous to XSynchronize(). When you enable flow control, the server will try to send events lost packets instead of aborting the link when buffering capability is exhausted. If flow control is not enabled explicitly, you get the current behavior. o enabling delivery of events lost notifications via XSelectEvent causes the server to send such events instead of aborting the link when buffer space is exhausted. Either of these methods would be completely upward compatible with current behavior since an application must explicitly ask for events lost packets to be sent, otherwise you get today's behavior. It should still be a guiding principle of the design of X transports that they do whatever possible to avoid getting into the situation where either a packet must be droped on the floor or the server/client connection aborted. However, it is a fundamental fact of life in protocols of this sort that data loss due to buffering capacity being exceed can and will occur. The trick is to find a reasonable way to handle the situation. X's current method--terminate the client link--is Draconian but effective. Allowing an application to receive an events lost event, if the application so chooses, puts the decision whether to abort the link in the hands of the client rather than the server. This seems to me in perfect keeping with the general X philosophy of not having the server make policy decisions that are better made elsewhere. Who knows better than the client itself whether the situation is recoverable at the client end? --PSW
60.76	devil's advocacy	SSAG::GARDNER		`Sun Feb 19 1989 19:56`	23
	I know absolutely nothing about the X protocol per se, so maybe this is off the wall. But it's an obvious enough question that it needs to be asked. Why can't termination of the connection be treated as an "events lost" notification? When a connection is broken, why doesn't the application and/or toolkit try to re-create the connection and, if it succeeds, do whatever it was going to do to recover from "events lost". When events are lost, the state of the application's windows, etc. are essentially indeterminate; it should probably re-construct or refresh them from scratch (a previous response suggested doing this regardless of any shortcuts that might be possible). Why can't it just re-create them on a new connection? If such an approach is plausible, it has the advantage of being totally compatible with the current X protocol. Plus it avoids a potential pitfall of adding "events lost" notifications. Suppose an application crashes somehow without explicitly tearing down the connection. My impression is that there's no convenient way for me, from the server, to abort the connection and recover the server resources that are devoted to it. The server might merrily preserve the connection forever, discarding events as necessary.
60.77		PSW::WINALSKI	Paul S. Winalski	`Mon Feb 20 1989 01:17`	6
	You can't just reestablish the connection because all of the windows, graphics context, etc. associated with the connection is destroyed by the server when the connection goes away. --PSW
60.78	Why not make it a Extension ?	LESZEK::NEIDECKER	Dont force it,get a bigger hammer	`Mon Feb 20 1989 01:57`	11
	Re. 70-71: If it is so hard to get thhis additional event accepted by the consortium, why don't we make it into a extension package that DECwindows servers support ? If it turn out to be the solution, we have a bonus, if a server doesn't know the extension, our clients (the Toolkit, etc.) falls back to whatever it does today (e.g. nothing). Should be little registration hassle ? Burkhard Neidecker-Lutz, Project NESTOR
60.79		SSAG::GARDNER		`Mon Feb 20 1989 12:47`	11
	> You can't just reestablish the connection because all of the windows, graphics > context, etc. associated with the connection is destroyed by the server when > the connection goes away. But doesn't the toolkit/application have a representation of that information in the various toolkit data structures? Since unknown events have been lost, you have to walk these data structures anyway to restore the windows, graphics context, etc. on the screen. To this (possibly naive) observer, it doesn't seem significantly harder to re-create the objects first.
60.80		PSW::WINALSKI	Paul S. Winalski	`Mon Feb 20 1989 15:44`	10
	If you lose events, the server still has windows and graphics context for the application. It's just that they may not be quite in the state that the application thinks they are in. If you break the connection, the server throws away the windows and graphics context completely. If the connection goes away and then the application establishes a new one and restores things, the user will see the application's windows actually disappear from the screen and then come back again. --PSW
60.81	If you can't solve the problem, avoid it ?	STAR::MANN		`Mon Feb 20 1989 20:02`	17
	If the server detects that the user is trying to select a stalled session, why not just display a skull and crossbones ? This method: 1 - Give the user appropriate feedback 2 - Prevents the server from entering a (temporarily) deadlocked state 3 - Prevent the application from being needlessly aborted 4 - Does not involve any X protocol changes Ever notice the terminal driver lock your keyboard ? Guess what it would have to do with that character if it let you type it ? Or is the X server code unmodifiable in this manner ? Bruce
60.82		PSW::WINALSKI	Paul S. Winalski	`Mon Feb 20 1989 21:37`	8
	It's more complicated than selecting a stalled session. Suppose you push a window. That could cause a string of exposure events, some of which can be sent and others of which can't because the buffer space was exhausted. It's hard to tell before the operation occurs that it could cause somebody to overflow buffer space. --PSW
60.83	complicated, I believe	STAR::BRANDENBERG	Intelligence - just a good party trick?	`Tue Feb 21 1989 12:40`	52
	Sorry, but the line of reasoning in .77 is faulty. Consider the following extract from page 76 of the X11, Release 2 Protocol Document: For a given "action" causing exposure events, the set of events for a given window are guaranteed to be reported contiguously. If count is zero, then no more Expose events for this window follow. If count is non-zero, then at least that many more Expose events for this window follow (and possibly more). Implications of adding a "LostEvents" event: 1. Protocol and semantics change. Count is more like a "hint" than a reliable value. 2. Application programs change. Certain coding constructs are no longer acceptible. For example, an event handling routine may switch on the eventType in an event packet to execute code such as: switch (ev.type) { case Expose: /* * dump extra expose events / for (i=0; i<ev.exposeEvent.count; i++) XNextEvent(dpy,&dummyEv); / * do generic exposure handling / do_exposure(); break; This code is correct under the current protocol and server semantics but is incorrect after the suggested protocol change is made. 3. Scope of "LostEvents" Event burdens all* applications. For reasons of implementability in the server, this event should probably be associated with a connection and not one-or-more-per-resource as are many events. (In this respect, it is much like the unmaskable events.) So, if an application were to turn on this event, any toolkit it used would also see the event. Or, if any toolkit wanted to receive this event and turned it on, it would be turning it on for the application. Is everyone prepared to handle this event even if only to ignore it? The addition of a "LostEvents" event is necessary. An xlib request similar to Paul's XSetFlowControl (or a point revision to the protocol version sensed at connection setup time) may be desirable. But these two alone are not sufficient to make X reliable. Event processing and generation does change. And, do we have the functionality that allows an application to recover relatively conveniently from such an event? monty
60.84	Some additional thoughts...	IO::MCCARTNEY	James T. McCartney III - DTN 381-2244 ZK02-2/N24	`Tue Feb 21 1989 14:10`	16
	RE: .83 Whether the terminal driver locks the keyboard, or simply throws away any character for which it does not have buffer space causes identical results. In either case the datastream being sent to the host is interrupted and data is lost. The keyboard being locked does not physically prevent data from being transmitted on the line, nor does it stop an operator from continuing to stike keys. Although I agree with the behaviour of the terminal driver, it is not an adequate model for solving the problem inherent in X. The terminal driver does not provide the needed "data lost" indication. I like the idea of a skull and crossbones, especially if it was imaged inside of a solid black locator cursor. James
60.85	Ok, how about this ?	STAR::MANN		`Tue Feb 21 1989 20:51`	20
	>It's more complicated than selecting a stalled session. Suppose you push a >window. That could cause a string of exposure events, some of which can be >sent and others of which can't because the buffer space was exhausted. It's >hard to tell before the operation occurs that it could cause somebody to >overflow buffer space. When a session stalls, immediately shrink it to an icon (automatically) and queue the event/message to it which caused the stall in an "overflow" buffer(s) (and display a skull and crossbones). Now it cannot become the recipient of exposure events, can it ? If the session unjams, send the overflow buffer(s) and resume normal operation. "buffer space exhausted" is a policy, right ? The workstation has not run out of memory ! Transport is simply advising it that it is no longer sensible to send messages because they cannot be delivered just now. Just reflect this condition back to the user in a way that prevents the user from continuing that session in a non-discretionary manner (make the user use his memory). Bruce
60.86	can the server do that?	AITG::DERAMO	Daniel V. {AITG,ZFC}:: D'Eramo	`Tue Feb 21 1989 23:19`	9
	re .87 >> When a session stalls, immediately shrink it to an icon (automatically) Isn't it the window manager (i.e., another client) and not the server that knows about things like icons and where they go? Dan
60.87	Some sleight-of-hand, a little smoke and mirrors..	POOL::HALLYB	The smart money was on Goliath	`Wed Feb 22 1989 09:57`	7
	> Isn't it the window manager (i.e., another client) and not the > server that knows about things like icons and where they go? Maybe the server could send a message to the wm saying "put this guy in the drunk tank". Come to think of it, the icon box icon looks a bit like a jailhouse window...
60.88		VWSENG::KLEINSORGE	Toys 'R' Us	`Wed Feb 22 1989 10:09`	64
	Let's look at what the terminal driver (VMS), terminal (VT200) and a random application do... Terminal data comes in and the terminal class driver puts the character data into a typeahead buffer and completes a outstanding read if the conditions of the read are met. If the typeahead buffer contents reaches a certain degree of 'full', the class driver tells the terminal to shut-up (XOFF). It will still accept data until the typeahead is full at which point it drops any further data and returns a DATAOVERRUN when the typeahead is finally read. When the terminal gets the XOFF, it also (at least on VT200's and VT300's) has some amount of buffering and will buffer transmit data until its buffer is full at which time it sets the WAIT LED and drops transmit data. The application is ovblivious to all this. Periodically it reads the typeahead buffer and only knows about any of this when and if it gets a DATAOVERRUN message. Extend this to the X11 world: First, this implies that the client software which manages the connection gets asyncronous notification of a event and moves the packet from the transport to the clients event queue (i.e. this operation is not a side effect of a processing loop in user mode!). This software sends a message to the server when the clients event queue reaches some degree of 'full' telling the server to shut-up (XOFF). It of course still accepts new events until it runs out of free packets at which time it starts dropping events and begins to build a event-lost client structure. The server, sends event packets off as long as it hasn't been told to 'shut-up'. By a combination of local buffering on a per-connection basis by the server and the 'slop' in the client side event queue after a XOFF, the event "counts" should always remain valid, that is, even if the server is XOFFED after having sent the packet with the "count", the combination of the client bufering to soak up packets already "in the pipe" and the server buffering of unsent packets would deliver all the packets promised (so it doesn't need to change the meaning of the count to a 'hint'). If and when the server runs out of buffering, it starts building a server-side event-lost structure that will be used to build a event lost event when the client starts taking input again. This implies that the server is smart enough not to send a counted event during a XOFF if there is not enough local buffering available for all the event packets. When the server reaches the point that it is discarding events, it can do some visual 'hints' that it is stalled including ringing the BELL for KB input, turning off autorepeat on the KB, Setting the WAIT LED on the KB and changing the cursor shape. Since all of these can be restored to a proper state once the error state is corrected. Now, all of this is probably meaningless, because I've got this nasty feeling that the client-side input queue is built as part of the polling loop, and otherwise the data just stacks up uncollected in the transport buffers (DECnet, whatever).
60.89	I ruminate?	VINO::WITHROW	Robert Withrow	`Wed Feb 22 1989 13:31`	57
	Im not a Xwindows maven, so dont yell at me. I'd like to categorize the later portions of this note (which seems to have migrated somewhat from the original topic). I will only be speaking in ``broad conceptual'' terms. It seems to me that there are two concerns: 1) What should happen when a client is sourcing events faster than the server is sinking them, and 2) what should happen when the server is sourcing events faster than a client is sinking them. In case (1) it seems that most participants think it is fine if the client is forced into quiesence (forced to nap) until the server has caught up with it. Seems reasonable to me. Nothing is lost and other clients are not affected. In case (2), one can not force the server to nap because that will affect other clients. A previous reply suggested a skull and crossbones cursor, etc. Other objected that information is getting lost. Comparisons with terminal handlers were made, etc... Can we take this in parts? a) Does everyone agree that (2) is a ``policy'' issue? I mean, it's nice to claim that a client should always be able to sink events at least as fast as the server can ever source them, but I don't think that is possible since one can never have infinite buffering. Lacking infinite buffering one must have flow control, and that seems to me to mean ``flow control policy''. b) It seems that flow can be controlled in several places and in many different ways. Suggestions have been: Implement flow control in X protocol, possiblly by throwing excess events into the can and telling the client we did that; Implement flow control in a lower layer, and if (2) happens take drastic action (which seems to be what is done now); Implement flow control in the server by refusing to send events until the client catches up. Are there more? Since (I hope we agree) this is a policy issue, I guess I would like to see it resolved in the server, since I feel that it is rude of the server to bombard the client with events, and, in the interrest of robustness, I would prefer to assume that the server is smarter than the typical client (and thus should be able to restrain itself). Also, it is a single point solution that does not require every single client to worry about what to do with a rude server (servant?). To that end, it seems reasonable to handle (2) this way: When the server discovers that it is sourceing events faster than a client is sinking them it should: a) Ignore all user input into the window(s) associated with the client (Perhaps it should beep for keypad input, and should turn off the mouse pointer when it enters the window), and b) not send exposure events to the client. If the server does save-unders it would be free to repaint exposed areas itself from its backing store, otherwise it should just leave the ugly holes in the window. Later, when the client catches up, the server should again allow user input in the windows, and (if it wanted to send any exposure events but couldnt) send an exposure event for the entire window. Like I said, Dont yell at me!!!!! ;-)
60.90	I rusticate	STAR::BRANDENBERG	Intelligence - just a good party trick?	`Wed Feb 22 1989 15:13`	177
	Re .91: I'll talk some more... I agree that this is at least a matter of policy but may also be a matter of protocol and server specification. (My earlier reference to the interpretation and generation of expose events is sufficient to make the latter true.) I also accept the policy on case (1) where the client can't send to the server. Now, as for case (2), you've summarized the possibilites as being: a. Drop events and generate a "LostEvents" event when possible. b. Drop the connection. c. Drop events but don't give any indication to the application (there may be user/device feedback, however). If reliability, at least as I understand the term, is a goal, then b. is clearly unacceptible. If either a. or c. is chosen, the protocol and server specifications still must change (see previous discussion on the interpretation and generation of expose events). Furthermore, I believe that c. is extremely unfriendly to the application. It doesn't find out that it has lost events until it either receives information on the server state that is inconsistent with its model of the server state or the user tells the application, via a "fix-up" request, that it is confused. Consider a window manager in window-resize mode: it has grabbed the server, it's receiving mouse motion events to perform stretchy-box operations but the server drops some number of mouse motion events and the upclick of the mouse. At what point does the window manager find out that information was lost so it can ungrab the server and return to a safe state? Now, I'll jump into a policy definition for all data sent from the server to the client. Keep in mind the following things: 1. Any client can send any event to another client with XSendEvent(). 2. Clients interact with other clients as a natural part of operation. One client's requests may result in any number of events being generated for any or all of the other clients. 3. Extensions. Always remember extensions. We don't know what they'll look like or how they'll define their own events, if they do at all, and use whatever policy we establish. 4. Certain events/state transitions currently guarantee that certain other events will be sent at a later time. If event delivery becomes unreliable without disconnecting a link, these "guaranteed" events may not be received by the client. The following "#define"'s are taken from x.h. They represent the currently defined event codes. I've also included replies (type code '1') and errors (type code '0'). #define X_Reply 1 Reply to request issued by client. Unlike events and errors which are always 32 bytes, this may range in size from 32 to 2^34 bytes. There is some indication that the client will try to read data but should the server wait unconditionally for a slow or hung or thrashing or malicious client? I suggest a configurable parameter specifying a timeout for reply operations, probably on the order of 5-10 seconds. If the client doesn't respond, disconnect. #define X_Error 0 Some request generated an error. Errors generated by asynchronous requests are asynchronous, those generated by synchronous requests (i.e. those expecting replies) are synchronous and the event is sent in place of the reply. If the latter case, errors should be treated as replies and the timeout should be used. If the former case, they could be treated as either replies or as events (they may be dropped). #define KeyPress 2 #define KeyRelease 3 #define ButtonPress 4 #define ButtonRelease 5 #define MotionNotify 6 Indicates that a keyboard key was pressed or released, a mouse button was pressed or released or that an "interesting" motion of the mouse occurred. With unreliable delivery, release and press events may not match up. If the Button?Motion masks had been used in requesting mouse motion events, a stream of mouse motion data may suddenly stop without any indication that a button had been released. Etc. #define EnterNotify 7 #define LeaveNotify 8 Indication of mouse travel through the window heirarchy. With unreliable delivery, any part of the traversal may be dropped so that there will be no indication that the mouse passed out of, into, or through some number of windows. This may confuse some applications. #define FocusIn 9 #define FocusOut 10 Indication of change of input focus to some windows. Also traverses hierarchy much like enter/leave notify so same caveats apply. #define KeymapNotify 11 Report of state of keymap. Currently, when requested, it is sent after every enternotify and focusin event and a client can rely on this. With unreliable delivery, this event may be lost or the preceeding focusin and enternotify may be lost thus creating an unexpected event. #define Expose 12 #define GraphicsExpose 13 #define NoExpose 14 Previously discussed. Has a "reliable" count down field for contiguous events. This no longer works with unreliable delivery. #define VisibilityNotify 15 Sent to a client after hierarchy change operations. If lost, client may not know that a part of the display is now visible. #define CreateNotify 16 #define DestroyNotify 17 #define UnmapNotify 18 #define MapNotify 19 #define MapRequest 20 #define ReparentNotify 21 #define ConfigureNotify 22 #define ConfigureRequest 23 #define GravityNotify 24 #define ResizeRequest 25 #define CirculateNotify 26 #define CirculateRequest 27 #define PropertyNotify 28 IMPORTANT See the protocol specification. Used by window managers to intercept application requests for hierarchy changes, etc. If these are lost, the window manager will REALLY be confused. How are these recovered? #define SelectionClear 29 #define SelectionRequest 30 #define SelectionNotify 31 Selection events. Loss of these may mean that several clients think that they own a selection or other problems. #define ColormapNotify 32 Notification that a colormap has been changed. Window managers and clients are interested in this. Loss of this event will prevent colormap install oscillations. hahahahaha. #define ClientMessage 33 Generic information from one client to another. Also used to "wakeup" toolkit from AST level. Since this information cannot be recovered by a request, who should receive an error if this can't be sent? The recipient or the sender? Should this be made to execute like replies? #define MappingNotify 34 Report that a modifier, keyboard, or pointer mapping request was executed. Loss of event means that a client may use the wrong mapping when it again receives input events. There is more to flow control than just dropping data and repainting windows later. THIS IS A BIG PROBLEM. monty
60.91		PSW::WINALSKI	Paul S. Winalski	`Thu Feb 23 1989 16:53`	22
	RE: .92 I agree, it's a big problem. It's far too big a problem for the server to arbitrarily decide for a client whether or not the situation is recoverable. If a client receives a LostEvents event, it knows which events it had enabled reporting for, and therefore what recovery actions have to be taken (if indeed any are possible). Receipt of a LostEvents event is an error condition. Any client is well within its rights to treat receipt of this event as unrecoverable and abort the link. For example, the window manager probably would abort upon receipt of LostEvents, since the event that was lost might be CreateNotify, DestroyNotify, or one of the other events that you cited. On the other hand, I have written several applications that listen only to a small number of events and don't really care if they miss one or more of them--if they are told that events were lost, they can query the server as to the present situation, or for some events (exposure, for example) they can assume the worst case and do recovery. Why should these sorts of applications get terminated unconditionally by the server? --PSW
60.92	Just say WAIT	CVG::PETTENGILL	mulp	`Thu Feb 23 1989 20:53`	28
	One solution would be to have the server clear the screen and display a big WAIT whenever it became blocked trying to send to a client. However, that might lead to a deadlock, or at least a situation where the user must wait a long time for things to free up, so the server would need to watch for multiple ^Y's so that I could ask `Are you pounding on ^Y to abort the client?' Seriously, I'm mostly kidding above. But now I'm not. No scheme can prevent data arriving faster than it can be sent out and with a user involved, you can't `flow control' a user so you are always going to be faced with the possibility of data overrun. Therefore, it will be necesary to discard data one way or another and somehow notify the only thing that can deal with the problem in an intelligent fashion, the user. Currently this is done by waiting for a while and then killing the connection and discarding all the related data (and probably discarding some or all the data that the user supplied while waiting) and then when the application clean up is done the user is notified by the absence of his application and possible gets an error message. The proposal to send a `lost event' event is a compatible extension. If the application can't deal with problem at all, or only sometime, then the behavior is no different than today. If on the otherhand, it can recover, then it is a big improvement. Note that the WM can recover totally, although the user might notice the recovery. If you don't believe me, just stop the process and then run it again. Everything will return to the way it was. Maybe its not the best that one could ask for, but it is better not allowing the user to continue at all.
60.93	Window manager can't really recover...	DECWIN::FISHER	Burns Fisher 381-1466, ZKO3-4/W23	`Fri Feb 24 1989 12:28`	15
	Just a nit about .94: The window mananger can't recover completely from the situation that PSW mentioned: Loosing a MapRequest. In this case, the client which issued the Map will just sit around forever thinking that it got mapped, but not really being mapped. When the window manager makes its "recovery" scan to figure out which windows to work on, it will never deal with the "hanging" window, because it will assume that the client has not requested that it be mapped yet. However, having the window manager abort in this case does not help either. This is a good example of the dilemmas faced when you try to break the "reliable byte stream" assumption, though. Burns
60.94		PSW::WINALSKI	Paul S. Winalski	`Fri Feb 24 1989 14:07`	11
	The point is that we DON'T have a reliable byte stream today. Should the window manager or any other client get behind in processing events for any of a number of reasons, the server will abort the connection and discard the queued events. The only thing that a LostEvents event does is let the client decide whether to abort the connection instead of the server. If the LostEvents feature is left disabled by default and enabled by an explicit XSetFlowControl() type call from the client, then the change is completely upward compatible. --PSW
60.95		KONING::KONING	NI1D @FN42eq	`Mon Feb 27 1989 15:44`	26
	Not only do you not have a reliable bytestream now, you never did, and you never will. Incidentally, the comment in .74 about "...transports...cannot be made reliable on VMS" misses a key point: this whole discussion has NOTHING to do with VMS, it has to do with fundamental and, I would have thought, well known properties of distributed systems no matter what OS they are built on. I can see from the analysis in .92 that, for some applications, recovery from EventsLost is harder than just repainting the windows. For many it will not be, though. And clearly every one of them always has the option to declare these to be fatal errors, in which case the situation is no worse than it is now -- other than being that way by design rather than by omission. Something to consider: currently application die when this problem occurs because the connection terminates (and they don't do Ed Gardner's "devil's recovery"). If one were to change X by simply adding this event, without adding the enabling stuff that PSW proposed, then many applications would still abort since they don't recognize the event. Is that incompatible? Then again, I probably can't get away with bitching about the X definition of "reliable transport" and at the same time proposing this definition of "compatible". :-) paul
60.96	More Fat for the Fire	STAR::BRANDENBERG	Intelligence - just a good party trick?	`Mon Feb 27 1989 16:22`	32
	Here are some things to ponder (not directed at any reply, in particular). o If a "reliable server" is a requirement for some users in some applications then do we need to provide a mechanism by which this user can enforce this policy? The policy might state that a client either accepts an eventsLost event or takes a quick disconnect on transport jam. Or, it might state that only eventsLost clients are accepted. If the latter, then sensing the type of client should happen at an early stage of a connection, say, when the client transmits its protocol level. o I've argued that eventsLost processing is basically connection-wide and not associated per resource. Hence, one part of an application will unilaterally decide for the entire application what mode it will run in. This isn't a problem for code that is newly written or that will be reworked for a new version of decwindows. But we already have a V1.0 product and, I assume, some sort of support commitment. If so, it would be incorrect for a new toolkit library to enable lostEvents for an old application or for a new, reliable application to rely on an old toolkit library. How does the left hand keep up with what the right hand is doing? o The default event processing is not upward compatible with the new lostEvents event. Most event processing will simply consume any unrecognized events because there already exist events which cannot be masked by XSelectInput so an application must expect the unexpected (to some degree). monty
60.97		KONING::KONING	NI1D @FN42eq	`Mon Feb 27 1989 17:06`	29
	I might want to have a policy that all the applications I use must support EventsLost. But I don't see a way to enforce that in the way you describe; that merely says that the application does something, but what it does isn't necessarily sane. Essentially this requirement is one of those of the form "The application must have high quality". This sort of requirement has the Felix Frankfurter property "I know it when I see it". My conclusion: a. EventsLost should be added in the next version of DECwindows. b. Handling of that event should be added in the next (or current, depending on planned release date) of every DEC product, an in particular to all widgets. c. Enabling of the sending of EventsLost (per PSW) is the job of the main program (via an Xmumble or XtMumble call). Widgets don't do this. Our applications do, of course, as soon as they have been fixed to handle the event. My guess would be that most of the work is in the widgets; the changes to support the new event would be minor for most applications (though not for all, obviously). So by doing the 3 steps I mention, we create the message: 1. Our application are now more robust than before (and indeed more so than any others in the industry). 2. Your applications can be, too, with -- usually -- a small amount of effort. Just use the new widget library, which, of course, is upward compatible, work out the recovery you need, and issue the enabling call. paul
60.98	Calvin & Hobbes Engineering Inc.	STAR::BRANDENBERG	Intelligence - just a good party trick?	`Tue Mar 07 1989 10:36`	24
	I suppose my perception of a need for enforcing a policy comes from notes in other conferences with references to "mission critical" X applications. Specifically, both NASA and ESA are looking at using X in their manned space programs with ESA actually using it for spacecraft instrumentation. Since Joe Astronaut probably isn't going to start an Xtrek session during a flight perhaps I'm overreacting but it was the possibility of applications such and a rather cavalier attitude about what constitutes sufficient testing in a system exhibiting stochastic behaviour that prompted my original tirade around reply .28. There was a Calvin & Hobbes cartoon some years ago that went like this: Calvin: Dad, how do they get the load limit for bridges? Dad: Well, Calvin, they drive bigger and bigger trucks over it until it breaks. Then they rebuild the bridge and weigh the last truck. Calvin: Oh! I should have guessed that! Unfortunately, this is exactly how software load limits are determined today. monty
60.99	Is anyone actually going to do anything?	WINERY::ROSE		`Tue Mar 14 1989 12:37`	7
	This is an interesting discussion. Sorry I missed most of it while on vacation... But is anyone taking an action item to try and get EventsLost added to the X protocol (for example, in the ANSI standardization process)? Re .97: By Felix Frankfurter I think you mean Potter Stewart.
60.100	let the server do it!	NEXUS::B_WACKER		`Fri Mar 24 1989 10:49`	53
	Since xlostevent is so fraught with problems and so unlikely to make it past MIT how about another approach? Use the terminal driver model (.88) so the session manager knows of the problem before it is too late. Create a modal message in the offending process window that says something like "This process is a hog and you can either wait till it's eaten its fill or push the kill button (in the box) if you want to get rid of it." Do all the previously suggested beeping, freezing keyboard, skull and crossbones, etc to tell the user that no other input will be accepted for this window other than the kill button. What about all the other windows? First version you could just stall them, too. Send a message to the clients to block until they get a message that the hog is satisfied or consciously killed by the user. Second version, you could make the server smart enough to watch for actions that generate messages to the hog. If that happens then stall the initiator and give it a box that says "waiting for the hog, kill or wait." That way if there's no cross-process communication going on the hog is the only one to suffer. You can still have real-time graphics output to the thermometer for your nuclear coolant! You could still move a window that is partially obscured by the hog out from under it and do virtually everything where there's no geometry interaction with the hog. Advantages I see: 1) The only new protocol is the stall to the client (which may already be there for sync??) 2) The user (not the application) is in control of whether or not the connection is terminated. 3) All the implementation is in the server so upward compatibility is guaranteed. 4) A session manager option could enable this functionality or the current "tough s__t" approach. 5) It could be a very important differentiation feature between our server and other vendors' if MIT drags their heels. 6) You completely avoid the impossible problem of how to recover from lost events because you don't lose any. (There really is no general solution to this problem, is there?) 7) The user is in control. 8) The user is in control. Bruce
60.101		KONING::KONING	NI1D @FN42eq	`Fri Mar 24 1989 11:17`	15
	I don't see how that can work. Some events are indeed caused by user input, and you could perhaps block those. But a lot of other events come from the actions of other clients -- for example, expose events occur if another window is moved, resized, deleted, or iconized. You can't block those operations because then you would affect other clients. (If you think it's ok to affect other clients, you might as well just halt the system when this problem occurs.) If you can't prevent other clients from doing the things that generate these events, then the only alternative, given that you have no place to store the events, is to discard them and let the affected client know that this happened. What' the big deal? This is elementary stuff in distributed systems design. paul
60.102		DECWIN::FISHER	Burns Fisher 381-1466, ZKO3-4/W23	`Fri Mar 24 1989 13:12`	11
	Personally, I think the first thing to do is to reduce (but not eliminate) the problem by giving the server the capability of saving the event away and trying later while continuing to process requests. Obviously this can't go on forever. The server runs out of memory if a client sits around idle long enough. However, it goes a long way toward alleviating the short term problem. In the long term, I think you have to tell a client that it has lost something. However, we want to minimize the frequency we have to do this. Burns
60.103	let the user decide	NEXUS::B_WACKER		`Fri Mar 24 1989 15:33`	20
	>(If you think it's ok to affect other clients, you might as well just >halt the system when this problem occurs.) If you can't prevent other >clients from doing the things that generate these events, then the >only alternative, given that you have no place to store the events, is >to discard them and let the affected client know that this happened. You only affect the clients that are muddying the waters of the one who's run out of resources. True that could escalate, but the USER could still abort if it is the "wrong thing". A bad apple in the barrel affects everyone sooner or later. >What' the big deal? This is elementary stuff in distributed systems >design. Doesn't that imply a design where the server is capable of restoring all the lost context or that the client has a copy of the server database, neither of which obtain here. Bruce
60.104		KONING::KONING	NI1D @FN42eq	`Mon Mar 27 1989 12:18`	10
	Resource issues caused by the fact that process P is not running as fast as it needs to should be confined to process P, and should not affect other processes. That's what I was pushing for. The fact that the other processes are doing things for the same user is irrelevant. re .102: I agree, reducing the incidence of the problem is a good first step while we wait for the real solution, if and when it actually comes to pass. paul
60.105	Why are we still arguing this?	IO::MCCARTNEY	James T. McCartney III - DTN 381-2244 ZK02-2/N24	`Mon Mar 27 1989 19:09`	14
	>>> A bad apple in the barrel affects everyone sooner or later. If my application that went off compute-bound was some critical life-support or fail-safe control mechanism, I'd sure hate for my display server to decide that it "wasn't playing by the rules" and disconnect it. The point is real simple, you can't stop all the different sources from which events can be generated, you can only hope to catch them all. When you can't, you must do something reasonable. The lost-events event is the clasical way to handle this type of flow control problem. It's not perfect, but at least the failure modes are such that you can recover. James
60.106		CVG::PETTENGILL	mulp	`Tue Mar 28 1989 14:45`	6
	Here's a variation on the flow control problem. Try xlsfonts with the full info option and watch as the server `hangs'. As the `man page' for it says under `bugs', this is a problem with the single thread server design. ELKTRA::DW_EXAMPLES note 116
60.107		ULTRA::WRAY	John Wray, Secure Systems Development	`Wed Jul 26 1989 14:59`	3
	Any news on this issue? Are the MIT people looking at it, or have they defined it to be a non-problem?
60.108		DECWIN::FISHER	Burns Fisher 381-1466, ZKO3-4/W23	`Thu Jul 27 1989 12:24`	11
	I talked to Bob Scheifler about it. He believes it is a non-problem. Monty Brandenberg was going to make a proposal for fixing it. However, he decided to leave the company and take up consulting before he could get to it. Version 2 of DECwindows relieves the problem to a large extent by doing additional buffering with DECnet. As has been discussed before, this does not truly solve the problem, but it does reduce the cases where we see it. In fact, I have not seen it at all since this happened. Burns
60.109		ULTRA::WRAY	John Wray, Secure Systems Development	`Sat Feb 03 1990 16:05`	18
	I don't understand how he can view it as a non-problem. Without application-level flow-control (and lost-event handling) of some sort, I can write a non-privileged application which can cause other applications sharing the same display server to crash. Bugs in one application can cause other applications to crash. Glitches on the network which tear down the transport connection can cause applications to crash. I've just demonstrated that a user with a quick mouse finger can kill random applications (although it is true that this is more difficult now than it was under VMS DECwindows V1). All this boils down to "X, as defined at present, is inherently unreliable", which seems to mean that it is unsuitable for most process-control applications. Or am I missing something? Is there any record of Monty's proposed fix? Is it being followed up by anyone else within Digital?
60.110	A voice from the past coming back to haunt?	DECWIN::FISHER	Burns Fisher 381-1466, ZKO3-4/W23	`Mon Feb 05 1990 13:01`	5
	He never wrote anything down. No, it is not being pursued. It is very hard to pursue a theoretical problem when there are millions of problems that customers see (and complain about) every day which are waiting to be solved. I agree...it is not fixed or solved. However...