[Search for users] [Overall Top Noters] [List of all Conferences] [Download this site]

Conference bulova::decw_jan-89_to_nov-90

Title:DECWINDOWS 26-JAN-89 to 29-NOV-90
Notice:See 1639.0 for VMS V5.3 kit; 2043.0 for 5.4 IFT kit
Moderator:STAR::VATNE
Created:Mon Oct 30 1989
Last Modified:Mon Dec 31 1990
Last Successful Update:Fri Jun 06 1997
Number of topics:3726
Total number of notes:19516

60.0. "SCS as transport for X?" by ANTPOL::PRUSS (Dr. Velocity) Sun Jan 29 1989 10:36

    We have the ongoing discussion on TCP/IP transport, how about another
    transport question.
    
    Could SCS be used as a transport for nodes within a cluster?  Is
    the SCS protocol amenable to being used in this fashion?  Is there
    any reason to believe it would offer better performance than DECnet?
    
    -fjp

T.RTitleUserPersonal
Name
DateLines
60.1STAR::KLEINSORGEshockwave riderSun Jan 29 1989 14:5010
    
    Hmmm.  Which workstation has a CI?  The only one I know of that
    "could" use SCS effectively would be the VS8000, but it doesn't
    support a CI adapter for it's BI.
    
    A more interesting idea might be using LAT as the transport, it's
    simple, small and fast.
    
    

60.2This window manager is confused.ANTPOL::PRUSSDr. VelocitySun Jan 29 1989 20:056
    I thought we used SCS on the Ethernet in an NI/MI Vc.  There aren't
    enough slots in a VS8000 for a CI, but that would be an interesting
    tangent!
    
    -fjp

60.3Not really!!SKRAM::SCHELLWorking it out...Sun Jan 29 1989 22:0316
>    
>    Hmmm.  Which workstation has a CI?  The only one I know of that
>    "could" use SCS effectively would be the VS8000, but it doesn't
>    support a CI adapter for it's BI.
>    
>    A more interesting idea might be using LAT as the transport, it's
>    simple, small and fast.

	Whoa!!  SCS is not a CI only protocol.  SCS runs on LAVC's, using
	the Ethernet as a transport.

	I think the real question is if SCS is a better protocol than
	DECnet task-to-task???

Mark

60.4Forgive me, but I just finished reading all the TCP stuff...DECWIN::FISHERBurns Fisher 381-1466, ZKO3-4/W23Mon Jan 30 1989 17:385
Oh great...you want us to "support" this too, or shall we just ship the image
for everyone to play with?

Burns

60.5MAXWIT::PRUSSDr. VelocityMon Jan 30 1989 18:3412
    What, you mean you have it working already and are holding out on
    us?! :-)
    
    Just an idle question for speculation, really.  But I kind of like
    the idea of sending stuff to a VAXstation 8000 from the
    VAX_THAT_IS_YET_TO_COME over the CI.  We know that SCS is much more
    efficient than DECnet FAL for file transfer.  I have no idea how it
    would compare to task-to-task for the X protocol.
    
    -fjp
    

60.6STAR::KLEINSORGEshockwave riderTue Jan 31 1989 00:0311
    
    My wife did a prototype of the "LAST" (a LAT derivitive) driver that
    talked directly to the CI.  Don't remember the numbers offhand, it was
    *very*, *very* fast.
    
    And raw LAT is probably about as quick as you want over the ethernet
    (though DECnet isn't a slouch on the ethernet for just raw data
    communication according to her tests).
    
    

60.7Remember DECnet over CI?DECWIN::FISHERBurns Fisher 381-1466, ZKO3-4/W23Tue Jan 31 1989 09:0810
Well, I can't say I know much about this, but remember when CI first appeared
a few years ago?  It was possible to run a DECnet circuit over the CI.  After
a while, though, it was determined that it was much more efficient to run DECnet
over the either and SCS over the CI.  (Now of course we also have SCS running
over the ether as well).  This may prove nothing, except to show that there is
precedent for deciding that is was better to use ether than CI for one particular
class of communication protocol.

Burns

60.8LESLIE::LESLIEAndy ��� Leslie, CSSE / VMSTue Jan 31 1989 13:003
    The reason it was inefficient and thus slow was that it still used
    DECnet!  Using native protocols would be much faster - and is!

60.9STAR::SNAMANSandy Snaman, VMS DevelopmentWed Feb 01 1989 11:1912
    Re .7:

    Regarding the old wisdom of running DECnet over the Ethernet rather than
    on the CI.  Some recent performance testing has shown that this has
    been a myth for some time.

    The advent of processors faster than a 780 made it possible to do
    substantially better using DECnet on the CI than on the Ethernet.

    


60.10KONING::KONINGNI1D @FN42eqWed Feb 01 1989 18:1213
I think the reason there is an NISCS isn't because it's faster (inherently)
than DECnet, but because it was the way the VAXclusters software could be
made to run on an NI.  So it's questionable whether that would be any
better than DECnet to talk X to workstations.

As for LAT, remember that LAT is a request-response asymmetric protocol
optimized for the character-at-a-time interactive exchanges of dumb terminals.
X uses a very different sort of data flow pattern (pipelined rather than
request-response) and is unlikely to run as well, let alone better, on LAT
than on DECnet.

	paul

60.11I doubt that an SCS transport on the Ethernet would be much different than DECnetSTAR::BECKPaul BeckWed Feb 01 1989 20:289
Paul K is correct in .10 as to the rationale for NISCS. There is relatively
little difference between well-optimized DECnet performance on the Ethernet
and equivalent performance using NISCS. The evidence for this is in the
performance figures of DFS, which uses DECnet, but which comes quite close
to LAVc performance on Ethernet. The numbers aren't identical, but then they're
not doing exactly the same things once they get off the wire. (Comparing LAVc
with DAP will not produce favorable comparison for DAP, on the other hand.)


60.12Then why is so much effort being put into a DECwindows transport using LAT?IO::MCCARTNEYJames T. McCartney III - DTN 381-2244 ZK02-2/N24Fri Feb 03 1989 16:287
An un-announced product to come out of DSG is planning to use LAT to transport
the X-wire protocol. If this is not such a good idea, then what needs to be 
done to get them to change their implementation strategy?

James

60.13***sigh***KONING::KONINGNI1D @FN42eqFri Feb 03 1989 17:035
We who have been trying to change that approach have been wondering about
that as well.  So far nothing has worked.

	paul

60.14RAMBLR::MORONEYBetter to burn out than it is to rust...Fri Feb 03 1989 22:0111
I would suggest using a separate Ethernet protocol for DECwindows transport,
rather than trying to lay it on LAT or SCS.  This way the driver code, packet
formats, etc. can be optimized for the type of traffic expected.  I'd guess
that Windows on ethernet SCS would probably do OK, but on LAT would be poor
since, as mentioned, LAT is optimized more for single characters more.

'Windows seems to be a big enough part of DEC's future that it should deserve
its own Ethernet protocol.

-Mike

60.15MIPSBX::thomasThe Code WarriorSat Feb 04 1989 00:249
A good implementation of NSP serves quite nicely as a transport for the X
protocol.  Since the X protocol consistently generates bidirectional traffic
all data ACKs tend to be piggybacked.  Thus the almost all the traffic tends
to be X packets with very little overhead.

Note: VMS DECwindow users may want to raise their workstations pipline
quota to 8K or more to allow DECnet-VMS to request delayed more frequently.


60.16maybe already being doneATLAST::BOUKNIGHTW. Jack BouknightSat Feb 04 1989 17:346
    re: .15, VMS DECwindows startup already checks for and SETs DECnet
    EXEC PIPELINE QUOTA to 10000.  I assume thats the parameters you
    were recommending be changed.
    
    Jack

60.17KONING::KONINGNI1D @FN42eqMon Feb 06 1989 12:2713
Re .14: just because there is a big market for something doesn't mean that
it should have a protocol of its own.  In fact, just the opposite is true:
by using standard protocols, you make the product even more attractive.

That's doubly true since, as was mentioned, X runs well over DECnet and
there is no reason to believe that it will run substantially better over
any other transport.  Besides, developing additional transports is 
expensive, counterstrategic, etc.  It prevents things from running over
wide area networks, gives a "we don't care about standards" message, and
so on.

	paul

60.18VISA::BIJAOUITomorrow Never KnowsTue Feb 07 1989 02:4220
�That's doubly true since, as was mentioned, X runs well over DECnet and
�there is no reason to believe that it will run substantially better over
�any other transport.  Besides, developing additional transports is 
    
    I'm feeling a bit doubtful about this statement. So far, we have had a
    number of problems using X over DECnet, links being losts and so on,
    and I'm sure a LAT (ethernet, whatever you want) based transport would
    be an excellent solution (especially for LAVc's).
    Internally, we are moving towards hidden areas to be able to connect 
    our VAXstations on the network. 
    DECnet phase V is too far away, and I believe we really need a LAT 
    (ethernet, whatever you want) transport. At least, something that 
    doesn't get stuck in a bottleneck that a L2 Router can be in such case.
    
    Anyway, have you ever gather statistics (e.g. packet/sec) of DECnet 
    usage when using X across it ?
    
    
    Pierre.

60.19???PSW::WINALSKIPaul S. WinalskiTue Feb 07 1989 14:1326
RE: .-1

>    DECnet phase V is too far away, and I believe we really need a LAT 
>    (ethernet, whatever you want) transport. At least, something that 
>    doesn't get stuck in a bottleneck that a L2 Router can be in such case.

I don't understand this.  If your VAXstation is plugged into an ethernet,
the DECnet runs over it.  Stations on the same ethernet can talk directly
to each other without involving any routing node whatsoever, let along a L2
router, if the stations are in the same area.

>    I'm feeling a bit doubtful about this statement. So far, we have had a
>    number of problems using X over DECnet, links being losts and so on,
>    and I'm sure a LAT (ethernet, whatever you want) based transport would
>    be an excellent solution (especially for LAVc's).

If you are talking about LAVc, then all of the nodes MUST be in the same
DECnet area, and they all must be on an ethernet.  DECnet works just fine
without involving any routing nodes in these circumstances, and it uses
ethernet.  Assuming that your ethernet hardare is configured properly, the
only case where you should be seeing logical links broken is when one or
the other machine goes down, and there is no preventing that.  I don't
understand your problem here.

--PSW

60.20STAR::KLEINSORGEToys 'R' UsTue Feb 07 1989 14:4821
    
    Paul, it's a common perception that LAT often works better than
    DECnet on the ethernet, especially if you've ever been on one
    of the segments in ZK.  I often have two machines next to each
    other that refuse to see each other, and CTERM over a couple of
    bridges in this building can be a hazard.  On the otherhand, I
    quit using SET HOST a long time ago because LAT proved so much
    more reliable (hence the perception) and often I get pissed
    when a copy tells me that my node isn't reachable when I'm
    VWSLATed from my node at the time I get the message.
    
    It may be that the difference is that DECnet is picky about
    making sure that the data actually gets there and gets there
    correctly, while LAT assumes everything is peachy and has much
    less error checking and looser "tolerences" (an error!? hey, let's
    send it again...).
    
    Anyway, from a typical "user" LAT usually looks more reliable.
    
    

60.21!!!VISA::BIJAOUITomorrow Never KnowsTue Feb 07 1989 14:5968
    Re: .19
    

�I don't understand this.  If your VAXstation is plugged into an ethernet,
�...
    
    No. Not if the VAXstation is in a hidden area (which is, for our case,
    area 63). In the area 51 (the regular one), we got a L2 router which
    talks to another L2 router which stands in area 63. The path for a
    packet from a satellite to the boot node (which stands in area 51,
    because we need access to the WAN) is then thru the two L2 routers.
    I believe you can get more info in the notesfile IAMOK::HIDDEN_AREAS.
    
    
�If you are talking about LAVc, then all of the nodes MUST be in the same
�DECnet area, and they all must be on an ethernet.  DECnet works just fine
    
    No, the nodes aren't in the same area, but they on the same LAN.
    Although they are on the same LAN, they got to go thru the 2 L2
    routers for *DECnet* communications (but not for LAT or SCS
    communication).
    
    
�ethernet.  Assuming that your ethernet hardare is configured properly, the
�only case where you should be seeing logical links broken is when one or
�the other machine goes down, and there is no preventing that.  I don't
    
    No, we have had cases where links were lost without having one node or
    the other being down. It's just that the DECwindows server just can't
    cope with the buffers (as I understood it).
    
    Note #293.0 in the notesfile HANNAH::DECW$DISK:[PUBLIC]DECTERM describes 
    more the problem. I quote without permission some of the content of the
    note. You can go in the notesfile to get  the exact context, for better
    accuracy.

    
�  Occasionally when the server has replies and events to write to a client 
�  and network output buffers are unavailable to perform the write operation, 
�  the current server would attempt the same write for a number of times
�  prior to disconnecting the non-responsive client.

�  In the duration of the retries, the server would not serve any other
�  client, and to the user, it would appear that the server is hung.

    As you can see the server will hang, but sometimes, I believe when
    time-out occurs, the server just gives up and drops everything on the
    floor.
    
    As a fix for the moment, we raised the Maximum buffer parameter in the
    boot node exec (from 100 to 200) and the pipeline quota. And wait and
    see.
    
    In our area (51), we should run out of numbers in a couple of months. 
    What will happen to the dozen of VAXstation I have ordered ? Run them
    standalone, out of the network ? Naah, everybody needs the net, so we
    just got to squeeze our elbows, waiting for DECnet phase V that should
    (as I understood it) solve the limitation of 64 areas and 1023 nodes
    per area, and use the concept of hidden areas.
    
    There maybe other concepts, but I ain't a specialist in this area,
    IAMOK::HIDDEN_AREAS covers more of the problem.
    
    
    (sigh) C'est la vie !
    
    Pierre.

60.22KONING::KONINGNI1D @FN42eqTue Feb 07 1989 16:3718
There are definitely some misunderstandings about DECnet going around here,
which isn't helping the signal to noise ratio.

It does NOT matter whether your areas are the same, different, hidden, or not.
If you're going from one endnode to another on the same Ethernet, then
traffic will go direct (after a few initial packets).  If the host is a
router, then things aren't always that efficient, but then again if you
run routing on your hosts things are slower anyway.

As for DECnet being flaky, there may be some resource allocation problems, 
bugs, or whatnot.  Certainly things can get bad when some of the routers
in the area are inadequate (e.g., 750s or worse).  There is nothing in the
architecture that makes DECnet any more or less reliable, as far as 
links staying up is concerned, than LAT.  Certainly there is no such
issue as "less error checking" or "looser tolerances".

	paul

60.23PSW::WINALSKIPaul S. WinalskiTue Feb 07 1989 16:5427
RE: .21

You are assigning the blame for lost client/server communication to the wrong
place.  The DECnet logical link remains intact--the problem is that the X
server is single-threaded and times out client applications on its own,
independent of the state of the DECnet logical link.  This is a bug in our
current X server implementation and is independent of the protocol used to
provide the client/server transport.  Switching to SCS or LAT would not solve
the problem--the X server would still run out of buffers and you'd still
be disconnected.  This is a problem that should be fixed where the problem
occurs--in the server.

As far as hidden areas go, we should not be making strategic product design
decisions (such as what protocols to use for X) on the basis of temporary
configuration problems on our own internal network.


RE: LAT

No question about it--LAT performs magnificently for what it was designed to do,
which is to package single-byte transmissions on multiple virtual circuits into
a single ethernet message between a terminal server and its client CPU.  It
is better than CTERM at this.  However, X is a message-passing protocol, and
I question whether LAT would work as well as DECnet or SCS.

--PSW

60.24VISA::BIJAOUITomorrow Never KnowsWed Feb 08 1989 03:1934
    Re: .22
    Well, believe it or not, our DECrouter2000's (which are the most powerful
    L2 routers at the moment, correct me if am wrong) do see the packets we
    are sending from one workstation to the boot node. 
    As well, how should I consider the Appendix A, paragraph A.6, page
    A-16, of the Networking Manual ? Have they got it wrong ?
    
    
    Re:.23
    From my user's point of view, what I see is a *lost* DECnet link.
    Whether it's DECnet or a server or a client doesn't matter to me. The
    link is lost, my work is lost.
    I'm glad you've found the bug, I'm sure it will be fixed for a future
    release of DECwindows. If, on top of that, it suppresses the occasional 
    hangs I have on my VAXstation, then perfecto.
    
�As far as hidden areas go, we should not be making strategic product design
�decisions (such as what protocols to use for X) on the basis of temporary
�configuration problems on our own internal network.
    
    I definitly agree. But I didn't imagine that adding another transport
    to the set of DECwindows' transport could be a "strategic product
    design". 
    
    
    Nevertheless, I will ask again my question: 
    Has anybody ever mesured the packet rate per second (for instance) over
    DECnet that DECwindows generates from  a local application to a remote
    display ? Any statistics produced ? Any performance tests ?
    
    
    
    Pierre.

60.25Area 51 speakingCASEE::LACROIXNo futureWed Feb 08 1989 03:4023
    Re previous:

    I'm in area 51 too... We have lots of workstations on a private
    Ethernet segment, and we were running into this problem of DECnet links
    being dropped on the floor (yes, it could be the X server timing out of
    its own). Gurus in CASEE came with a very successful hack a couple of
    months ago: basically, whenever a workstation was rebooted, NETACP on
    the boot member was going paging like crazy going thru the entire net
    database, looking for info on the workstation. That, plus the MOM
    process and a too small working set for NETACP was causing *ALL* X
    connections between the boot member and other workstations to be
    aborted. The fix is to use an area number small enough to cut down on
    NETACP's paging rate: area 1. Our boot member now thinks our
    workstations are in area 1, and thus finds info on what it should do
    with our workstation turbo fast. No more paging, no more links dropped
    on the floor, no more 10 seconds cluster transitions, etc...

    Inidentally, folks with talk to in the states were not very receptive
    to the problems we were having; I suspect this is related to the fact
    that you have a smaller problem when all your satellites are in area 3.

    Denis.

60.26STAR::BRANDENBERGIntelligence - just a good party trick?Wed Feb 08 1989 09:5420
    re: various
    
    What PSW said about the location of the problem is absolutely correct. 
    The problem begins with a poorly designed protocol, is aggravated by
    the VMS interface to DECnet, was only partially corrected by the
    transport, and a last-chance keep-alive effort was made in the server. 
    There is work-in-progress to make future versions better.  What can you
    do now?  Use tcp/ip.  Yes, even for vms-to-vms connections.
    
    Lat?  It's being looked at but what paul suggested may be true.  A
    protocol may or may not save you.  Lat works nicely when there are many
    data sources mapped to many data sinks but what will happen when there
    is a *single* data sink (a server).
    
    As for network load statistics, a test suite has been created and
    numbers have been collected.  A report is being written (I haven't seen
    it yet).  It should be interesting.
    
    						monty

60.27KONING::KONINGNI1D @FN42eqWed Feb 08 1989 11:198
Re the problem of NETACP taking so much time on downline load requests:
that certainly is a problem.  It has been known for years.  There are
various obvious solutions that haven't been implemented.  However, none
of that has ANYTHING to do with the issue of which transport is appropriate
for X.

	paul

60.28Technical reasons for protocol problems?WINERY::ROSEWed Feb 08 1989 14:456
    Re .26: "The problem begins with a poorly designed protocol..."        
    
    I realize this is kind of complicated, but could you please elaborate?
    (This is not an argument, but I am just very curious because when
    reading over the X protocol I did not see anything particularly wrong.) 

60.29You'd think they'd learn after a whilePRNSYS::LOMICKAJJeff LomickaThu Feb 09 1989 12:546
It seems like the modern equivalent of assuming all computer terminals
will operate at 38.4KB continuously without the use of xon/xoff...

Figures, considering the source.


60.30DECWIN::FISHERBurns Fisher 381-1466, ZKO3-4/W23Thu Feb 09 1989 15:305
A couple of notes here were hidden pending a discussion among the moderators.
We got a complaint.

Burns

60.31Wait a minute...CIM::KAIRYSMichael KairysThu Feb 09 1989 15:4221
    I would like to complain in the reverse direction. I was fortunate to
    have read note .29 just minutes ago, prior to its being set hidden. I
    believe I can guess what prompted the impulse to hide it. 
    
    However, I think the note presented information and a point of view
    that is important and needs to be aired. I think .29 should be used to
    start a discussion about real-world requirements which may (and should)
    lead to those requirements being addressed. My area of concern is
    discrete manufacturing; perhaps not as "critical" in some senses as
    nuclear engineering but nontheless an area which demands dependable
    delivery of information and needs windowing technology.
    
    Perhaps the note could be slightly edited, if someone insists, and 
    returned to view. Personally it didn't seem inflammatory to me, but I'm
    from Ann Arbor...
    
    BTW, I also think note .31 presents a point of view about the history
    of X that is worth (re?)stating. 
    
    -- A Concerned Citizen

60.32DECWIN::FISHERBurns Fisher 381-1466, ZKO3-4/W23Thu Feb 09 1989 16:577
There was not an "impulse" to hide it.  Someone (not from VMS development, I
might add) was concerned about aspects other than inflammation.
Please let it go at that for the moment.  I did not say this was the final
word.  That is what "hide" is for as opposed to "delete".

Burns, unfortunately a moderator

60.33Odd that LAT, not SCS, is the main topic when LAT was in another noteCVG::PETTENGILLmulpThu Feb 09 1989 19:2529
re: .23

>No question about it--LAT performs magnificently for what it was designed to do,
>which is to package single-byte transmissions on multiple virtual circuits into
>a single ethernet message between a terminal server and its client CPU.

The above statement is about `1/3 true'.

Bruce Mann usually talks about his experience developing network applications
(based on DECnet) when talking about the goals he had for LAT.  He wanted a
fast (ie., low in network and CPU overhead), fast (just in case you missed it
before), simple (ie., something that didn't take an army of programmers and
managers), simple (ie., something that one person could do and that would be
implemented widely), LAN transport.  Most of the work that Bruce was doing was
realtime data aquisition, but terminal character echoing is best if it is in
realtime, so terminal I/O is very applicable.  LAT is NOT Local Area Terminal;
LAT is Local Area TRANSPORT.

LAT and SCS have a number of things in common (Bruce was involved in the
architecture of both):  They both multiple multiple sessions over a single
virtual circuit and they both plug into the applications in the kernel rather
in user mode.  While these points make interfacing them to the system more
difficult, there is usually a payoff in terms of performance.

LAT was always intended to a multipurpose tools for supporting specialized
LAN applications.  X was intended to be a LAN application.  Depending on how
users use X, LAT+X may be a real winner.  If X replaces ASCII, as it does with
an X terminal, then the use will be right as fas as I can tell.

60.34Regarding the ProtocolSTAR::BRANDENBERGIntelligence - just a good party trick?Fri Feb 10 1989 12:3693
    re .28:  Yes, you did, it's practically on page one but it's so huge,
    none seem to notice.  Consider typical client/server operation:  the
    client sends asychronous requests to the server while the server sends
    asynchronous events to the client (say resulting from mouse motion of
    window reconfigure).  Only occasionally do the server and client come
    together and synchronize their communications with a request/reply
    pair.
    
    What does this mean?  It means that the only thing that keeps a
    client/server connection running is the buffering capability of the
    underlying transport implementation.  A server in the throes of
    generating motion events or window reconfigure events will run through
    code that commits the server to sending events to at least one and
    sometimes many client connections.  When this happens, the buffering
    capacity had better become available soon or the server will wait
    until it does.  The way the user's data is buffered on Berkeley-style
    networking implementations, it often is available.  But, say with a
    record-oriented interface and quota scheme as with the VMS interface to
    DECnet, it would almost never be available without additional work by
    the application.  (This is one of the intended functions of the common
    transport image on VMS.)
    
    "Well, then it's a VMS problem, isn't it?"  No.  I'm the first to admit
    that the VMS interfaces are often inconvenient for getting work done
    but in this case, they merely exaggerated a problem with the protocol
    they did not create it.  About two years ago, after finishing one of
    the first ports of the server to VMS, we experienced frequent deadlocks
    due to this problem (I should say we experienced infrequent successful
    operation).  I poked around, looked at the system, looked at the design
    and said, "Look, this protocol is a deadlocking protocol."  I received
    very little indication that anyone understood the problem or that they
    anyone was interested.  At this point, in my opinion, we should have
    worked on the server semantics, or changed the protocol, or...
    something but it didn't happen.  
    
    "Well, um, in R3 they fixed xlib to keep reading from the server if it
    can't write requests."  Yes.  On Unix.  But is that enough?  Must the
    operating system provide the means of recovery from a bad protocol? 
    Should a "reliable, production-quality, bullet-proof" server rely on
    the good behaviour of its clients to ensure that it continues to
    execute?  Should it rely on the stability and predictability of a
    network populated with LAVC's, NFS-served systems, diskless systems,
    gateways, bridges, etc.?  Should it rely on some unknown operating
    system scheduling its clients so that it can continue operation?  These
    are the sorts of questions one must ask when designing a reliable,
    distributed system.  Answers are even better but I don't have any
    that are clear and absolute.  How about some scenarios?  Here are some
    possibilities which I can imagine (though they may not exist in fact).
    And yes, they're pathological but they are intended as illustrations
    to encourage discussion of the technology.

    1)  A standalone workstation whose user has a few xterms, a
    wmohc (window manager of his choice), a clock, etc.  He runs an X
    application which creates windows, does some work, and interrupts it
    leaving it around but not running.  He goes on to do other things like
    pop windows and drag his mouse around.  All of a sudden, his
    workstation hangs while the server tries to send some events to a
    client that isn't running.  How do you recover?

    2)  A workstation on a network has a client from a diskless workstation.
    The link gets a bit behind while the client tries to write some requests
    so it, being an R3 system, dutifully tries to read from the server.  But
    the code that reads takes a page fault and the NFS server has just crashed. 
    Three seconds later, the X server wants to tell this client about the
    180 motion events that have occurred and so it hangs.  All because of a
    nfs server *two hops away*.
    
    "But, I've been programming on X workstations for years and it's
    usually worked for me!"  Well, so what?  Is this proof by example? 
    Let's be honest with ourselves:  the primary use of X systems up to
    this point has been as programmer's workstations, to develop
    programmer's tools, all to help programmers.  Only now is it moving out
    into non-programming and non-engineering tasks.  I hope I'm not
    bursting anyone's bubble with this proposition but, in my opinion, the
    standards of reliabilty and quality to which programmers in the world
    at large hold themselves *do not* compare favoribly with those in most
    other engineering and non-engineering activities.  By analogy,
    programming is to, say, civil engineering what astrology is to
    astronomy or what numerology is to mathematics.  Consider:  a power
    company might investigate using a workstation to display the operating
    status of a fission reactor.  Or medical equipment companies who'll
    make instruments to monitor patients in surgery.  Or manufacturers
    who desire to control time- and position-critical processes in a
    steel mill.  When one builds a skyscraper, it is anchored in bedrock
    not in mud.  I believe that this good-enough-for-programmers-so-its-
    good-enought-for-everybody attitude is *unacceptible* when the products
    of these programmers are actually used by the rest of the world.
    
    In taking this opportunity for a little bombastic opinion, I hope I was
    able to adequately describe the protocol deficiency as I understand it.
    
    					monty

60.35re: .30STAR::BRANDENBERGIntelligence - just a good party trick?Fri Feb 10 1989 12:3714
    (I've been stewing for two years but I'm feeling better now.)
    
    Yes, I'm not too happy with the design but I can be fair.  The "Boys
    From Cambridge" didn't set out to solve the world's display problems
    so many years ago (at least by my understanding).  They created a
    system that was built for programmers and students and it may be
    adequate for that purpose.  I certainly like to use the tools and the
    environment for my work (programming).  But first by accident and
    then by *executive decree*, it was decided to make a commercial system
    out of this that would solve the everybody's needs.  I am personally
    uncomfortable with the way in which these decisions were made.
    
    					monty

60.36Well, look at where the market is putting its moneyPOOL::HALLYBThe smart money was on GoliathFri Feb 10 1989 16:2912
    Nor is this the first example of the marketplace demanding an inferior
    product.  Your PC (Apple or IBM) crashes?  Oh, well, reboot it and get
    on with things.
    
    These kinds of problems are seen to be like cars stalling then starting.
    No big deal, it costs too much to engineer perfection.
    
    Nuclear reactor operations?  We'll buy two.  They won't both fail just
    prior to meltdown.  Etc.
    
      John

60.37Slight time warp Slight time warp (Old noters: remember those?)DECWIN::FISHERBurns Fisher 381-1466, ZKO3-4/W23Fri Feb 10 1989 16:483
For the record, .36 and .37  replace some notes which were deleted (29 and 31,
I think.)  That is why the context and order seems a bit funny.

60.38The inevitable follow-up questionsWINERY::ROSEFri Feb 10 1989 19:1315
    RE .36: Thank you, this is very interesting. Disclaimer: These are
    questions -- not arguments. I'm trying to understand your note, not
    rebut it. 
    
    Are you contending the following? It is impossible to write a server
    that does not hang if a client hangs and if enough events occur that
    are directed to that client. 
    
    You say this is even true over TCP/IP, just that the probability of
    hanging is much lower under TCP/IP on Ultrix? Is that because TCP/IP on
    Ultrix allows more stuff to be in the pipeline undelivered? 
    
    An even more general question: What is the simplest change to X that
    would make it possible to write a hang-free server? 

60.39The Ultrix server seems hang-proofFLUME::dikeSun Feb 12 1989 11:3810
I checked the Ultrix sources, and it doesn't look like the server is capable of
hanging.  The server sets up connections so that if a read or a write would
block, the call returns immediately with EWOULDBLOCK.  If the call was a read,
the server services other clients until the rest of the data comes through.  If
it was a write, the client is punted.

I don't intend to claim that anecdotal evidence amount to proof, but I have
never heard of an X server on Ultrix hanging in a read or a write.
				Jeff

60.40The problem is not in managing the line but the resources consumed by the server.IO::MCCARTNEYJames T. McCartney III - DTN 381-2244 ZK02-2/N24Sun Feb 12 1989 18:1823
RE.: .41

Consider an application that enables mouse events, then promptly goes off to
"sleep" (ignores making a call to get the next event). The process may actually
be doing something useful (like an FFT or Finite Element model). Meanwhile, the
impatient user is idlely dragging the mouse around generating 1000's of events
per minute. The server, attempting to preserve these events, is packaging them
up as quickly as it can shipping the out to the client. Eventually, the clients
network buffer fills, the network transport layer screams "No more..." and the
server has to decide to buffer it locally, or to drop things on the floor. 

Early servers, attempted to do no buffering and simply aborted the link, causing
intrinsic reliability problems. I can't speak for the existing VMS and Ultrix
server's having not seen the code, but I believe that this is one of the
problems to which Monty is refering.

In extremely severe cases, it is possible that the server will exhaust it's 
resources trying to buffer events locally, and thus hang. Until the dormant 
program gets around to reading it's event queue, nothing can be done on the
server. 

James

60.41%DECW-F-IPI-Insufficient programmer intelligence failure at ...IAGO::SCHOELLERWho's on first?Mon Feb 13 1989 10:169
re: .42

That is why we have been frequently reminded to not write progams that
disappear for a long time without check the event queue.  A small amount
of intelligence on the part of the application developer prevents this
client from being punted.

Dick

60.42Look at what has come beforeSTAR::BRANDENBERGIntelligence - just a good party trick?Mon Feb 13 1989 11:3071
    
    re .40:  Is it possible to write a hang-free server?  If it is not
    acceptible to drop a connection at the first sign of a hang, then
    I believe it is impossible to write a *reliable*, hang-free server.
    In previous replies (to which I will respond shortly) note how recovery
    takes place:  if a server write blocks, drop the client.  Most
    low-level networking protocols implement some sort of quota system
    (windows or debit/credit or ... ) in the protocol itself.  The X
    protocol implements it in the operating system interface (if it doesn't
    fit, kill the connection).  This is one thing that must change if we
    are to have a reliable server.  There are at least two ways that this
    can happen:  either by changing the protocol and server semantics to
    include a debit/credit system for server-to-client communication or
    by changing them to allow unreliable delievery of events.
    
    I'll consider the latter first.  In certain areas, the X server has
    already made some movement in this direction.  With the realization of
    how large a load can be generated by mouse motion events, the designers
    created a "motion history buffer" in the server.  If we're generating
    events too quickly, and the client allows, put motion events in this
    buffer and report to the client, via events, that there is something
    interesting in the motion history buffer.  While this implementation is
    along the lines of an infinite buffer approach, look at what they're
    really doing:
    
    	1.  Server attempts to send report and fails (or might fail).
    	2.  Server stores state change (mouse motion).
    	3.  Server reports to client availability of state change
    		(motionHints is non-zero or whatever).
    	4.  Client synchronously requests report of state change
    		(getMotionHistoryBuffer).
    
    Generalizing this and changing the implementation, would give a server
    that doesn't *insist* on sending every single event and a chance at a
    reliable, hang-free server.
    
    Or, how about a debit/credit system?  Xlib could piggyback event credit
    values on requests.  An initial maximum could be inferred from the
    networking quotas and that particular networking interface.  This still
    implies a server that isn't required to send events or one that is able
    to encapsulate state changes to be sent later.  Another way is to
    change the communication model to something along the lines of an RPC.
    Asynchronous client requests could still be asynchronous but server
    state changes (i.e. events) would be acquired mostly synchronously. 
    There might still be an event credit to retain interactivity if it is
    shown to be necessary but Xlib calls such as XNextEvent would become
    request/reply pairs.  These are just some ideas, nothing has been tried
    but I think these are interesting avenues to pursue.
    
    As for the tcp/ip vs decnet and ultrix vs vms issues...  it's more a
    matter of programming interface than either base protocol or host
    operating system.  VMS tcp/ip (connection), Ultrix tcp/ip, and Ultrix
    DECnet all work pretty much the same:  they're byte-streamed,
    socket-derived interfaces that buffer user data in an almost pure byte
    limited fashion  (I believe, 4K per direction and per side is the
    default in all the above mentioned implementations.)  VMS DECnet, on
    the other hand, has a record-like quota system based on segments with a
    $QIO more-or-less generating at least one segment.  The byte-stream
    model allows a client and server to run skewed which is to some degree
    a requirement in *any* distributed system.  (Imagine trying to pipe
    some shell commands together if the byte-quota on a pipe was, say, one
    byte.)  Unfortunately, this model is also tolerant of protocol design
    failures.  Architectures which are intrinsically deadlocking appear to
    work simply because the deadlock condition is unlikely and the allowed
    response to a deadlock, if the interface allows it to be sensed, is to
    give up.
    
    Just some thoughts...
    
    						monty

60.43Oh, yes, an experiment.STAR::BRANDENBERGIntelligence - just a good party trick?Mon Feb 13 1989 11:3816
    
    re .40:  I've had an idea for an experiment for sometime but I can't
    get the resources to perform it.  The idea was to get two ultrix
    machines on their own ethernet and setup an X test environment that
    would allow be to create an arbitrary cpu load on either a server or
    client machine.  I would then vary two variables, the load on a
    system (either server or client) and the mbuf quota for links, and
    observer and measure the reliability of various interactive
    applications.  My belief is that connections will become markedly
    unreliable as quota is dropped.  My contention is that there is no
    threshold at which a connection becomes reliable; that there is only a
    curve giving probability of failure which is never zero and which is a
    function of so many variables that we can never say "you're safe."
    
    					monty

60.44Hang-proof isn't the same as reliableSTAR::BRANDENBERGIntelligence - just a good party trick?Mon Feb 13 1989 11:488
    
    re .41:  You are absolutely correct in the ultrix case.  But first,
    only Unix has the nice FNDELAY option and must this be used to
    implement the protocol and server semantics?  And second, it doesn't
    hang but is it reliable?  Can't Joe Customer have both?
    
    					m

60.45STAR::BRANDENBERGIntelligence - just a good party trick?Mon Feb 13 1989 12:1636
    
    Re .43:  This is an extremely poor attitude to take.  I've already
    complained that X must rely on networking implementations to survive
    (an inappropriate mixing of levels) and now you're suggesting that the
    remaining slop be taken care of by the application programmer.  By the
    goodness in our hearts, we'll make this work?
    
    Truthfully, what justification is there for a call to
    ProcessInputEvents() in the outer loop of a 2D FFT?  Or an image
    convolution?  Or a large, atomic database transaction?  Or any of the
    other things that makes money for our customers?  I could argue from
    aesthetics (it's ugly), or structured programming paradigm (it's mixing
    levels), or from performance (it ruined the register optimizations), or
    from programmer convenience (they have to do everything), or from a
    quality assurance standpoint (more and more testing just to see if they
    can keep X alive).  And I claim it still isn't enough.
    
    The server can't control the application environment.  The application
    may be on another machine, on another operating system, in another
    country.  Well, neither can an application programmer completely
    control the application environment.  The programmer can't control when
    his process will be scheduled, can't control taking a page fault served
    by a crashed nfs server, can't control slow or overloaded or unreliable
    networks, etc. etc. etc.  The application programmer tries to get his
    algorithms correct and relies on the correctness of the system software
    to get the rest done.  Is the programmer's trust well placed?
    
    We are trying to create a reliable, distributed, interactive, graphical
    system.  (Those four adjectives are *very* important.)  I believe this
    is the single hardest networking problem anyone has yet seen.  It's
    more difficult than the base networking support (tcp/ip, udp/ip,
    decnet, whatever), rpc's, remote terminals, distributed filesystems,
    naming services, etc.  And I think it is not yet solved.
    
    						monty

60.46The future's not bright so take off your shadesSTAR::BRANDENBERGIntelligence - just a good party trick?Mon Feb 13 1989 12:2514
    
    Those who can begin to see the stochastic nature of these systems might
    think about the future.  The range of networking speeds is increasing. 
    Some people insist on serial line interfaces to X while others are
    preparing for FDDI and HSC.  The range of CPU speeds is increasing. 
    Two years ago, everything was pretty much one- to three-mips.  Servers,
    clients, pc's, routers, hp handheld calculators, etc.  Now we'll have
    Cray's, Connection Machines, Multiflow's, DAP's, MIPS boxes, SMP vaxes,
    on down to 68000-based X terminals.  This reliability curve I mentioned
    (really a reliability manifold) is dependent upon all these variables
    and others.  What is it going to look like in the future?
    
    					monty

60.47KONING::KONINGNI1D @FN42eqMon Feb 13 1989 12:335
Note that many of these would be non-problems if the operating systems we
use had decent multithreading facilities built-in.

	paul

60.48You've just moved the deadlockSTAR::BRANDENBERGIntelligence - just a good party trick?Mon Feb 13 1989 12:5217
    
    Re .49:  Do you mean for use by the server, one thread per connection? 
    If so, I think not (though others in VMS think it would be wonderful). 
    The problem is that clients intentionally and necessarily interact with
    one another.  They share real estate, keyboards, colormaps, etc. and
    when one client changes these, the others may need a report. 
    XSendEvent, properties, and selection encourage communication between
    clients.  And, because all these resources are shared, the database
    which maintains them is also shared.  And then there are clients which
    require atomicity across multiple X operations (such as the window
    manager) hence locking out other threads.  All this communication
    between clients implies locking, if a client needs a lock held by
    another client who is blocked by transport, that client will also
    block.  Conclusion:  server deadlocks can still occur.
    
    					monty

60.49Insufficient ArchitectureEVETPU::TANNENBAUMTPU DeveloperMon Feb 13 1989 13:2116
    Re: .43
    
    Yup, DECwindows requires that an application frequently check the input
    queue.  TPU had to jump through hoops to implement this.  And it's
    still not right.  I recently found that TPU's not checking the input
    queue while a subprocess is running (so don't do anything large in a
    subprocess and then wiggle your mouse on a DECwindows EVE window).
    
    How many other places have we missed, simply because no one considered
    yet another obscure area of the code?
    
    It would be a *LOT* easier if this was handled once, correctly, instead
    of trying to duplicate it in every application.
    
    	- Barry

60.50?WJG::GUINEAUMon Feb 13 1989 15:5813
Funny, My first use of X (DECwindows) was for an application that would
go off for more than 1 hour as a result of one mouse click. While it
was gone, the interface was dead! After a few contortions and mucho help
from this notes file, I got it all working by spreading ProcessXQueue();
calls all around the "work routine".

I never suspected the far reaching implications this really had, but
figured there must be a better way (like have a separate thread do the X Queue 
Processing asynchronous to the rest of the application.)

John

60.51KONING::KONINGNI1D @FN42eqMon Feb 13 1989 17:5314
Right.  I was referring to the application, not the server.

On the server side, there has to be a better way too.  For example, events
could be discarded when there are too many pending transmission to a particular
client.  Such flow control would of course have to be on a per-client basis.
Then when the flow starts again, the client would receive a "you just lost
some events because you were too slow" event along with the subset of real
events that was kept.  (You may recognize this approach -- it's the one used
in DNA for event logging.)  It may or may not be appropriate for the server
to provide some feedback to the user (bell, or some such?) in addition to
the events-lost event that goes to the client.

	paul

60.52Thought about that, too.STAR::BRANDENBERGIntelligence - just a good party trick?Tue Feb 14 1989 10:138
    
    We argued the possibility of an "events lost" event but the problem is
    with recovering the state change information from the server.  These
    changes are quite complicated and must be retained in some form for a
    client to keep it's environment in order.
    
    					m

60.53So, how about some feedback?STAR::BRANDENBERGIntelligence - just a good party trick?Tue Feb 14 1989 10:362
    

60.54DECW-F-NONMODULAR Program author not aware of DECwindows in 1967.IO::MCCARTNEYJames T. McCartney III - DTN 381-2244 ZK02-2/N24Tue Feb 14 1989 16:1379
RE: .43

I don't suppose that your are suggesting that we call the authors of packages 
like IMSL, SPSS, STRUDL, CHEATAH etc. and inform them that their carefully 
optimized matrix operations take too long. When we tell them that they should 
break up their routines for DECwindows applications (because we're incabable of 
building a robust server that avoids such complications), their reaction will be
the same as mine - laugh and go find a hardware vendor that build computers not
toys. If they had wanted a toy they would have called MATEL.

Seriously, if we can't solve the problem of flow control on the X event queues
and come up with a realistic interpretation of what to do when the transport
becomes clogged, we will have some very unhappy customers. Some of their sources
have been in existance since the middle 60's and the programmers that wrote the
codes may have actually retired! Cracking open all these dusty decks simply 
because DECwindows comes along is not a good reason. (This assumes that one 
calliously disregarding the modularity concerns is a viable option. Since 
we've heard over and over from these vendors: "Give us faster hardware, better 
and more interactive interfaces, but don't make us rewrite our codes.", we know
it's not!)

RE: .55

Feedback: Complete agreement with ideas expressed so far. The only thing that
still needs some discussion is what to do about the "lost events" event.

I see the problem with the need to keep the application and the server in sync,
but the hang (or hang-up) solution is definately not adequate. If an application
was to get a "lost events" event, would it not be safe for the application to 
assuem that it should initiate it's own recovery mechanism? For instance, unmap
all windows and remap to restore "correct" appearances?

How does discarding input events cause problems? Applications already know how
to tolerate typeahead buffer overrun. Simply droping mouse or keyboard events
that cannot by buffered should be sufficient. This behaviour is (I believe) 
consistant with existing experience and provides a system that will degrade
with diginity. 

Some special feedback mechanisms needs to be provided by the server to ensure
that this overrun condition is quickly detected by the human operator. I believe
there are only three different mechanisms that must be provided, keyboard event
loss, locator motion event loss, and locator button event loss. For for keyboard
event loss, simply ringing the bell al� the terminal driver is sufficient. This
same mechanism may also be useful for mouse button event loss. The difficulty
is to find reasonable feedback for the locator motion event loss.

For locator motion, we want to preserve the ability to move to another
application and continue work there, after all, concurrency is one of the good
things that workstations provide. Also the application we might be moving to is
our "hot backup" of the session that has encoutered overrun problems. Given that
you accept these design parameters, we obviously cannot just ignore locator 
motion input. We must also track the cursor location on the screen accurately, 
so we can't just refuse to update the cursor. This leaves only two variables,
shape and color. Perhaps we can define a cursor shape or color which can be 
interpreted as "locator events being discarded" Perhaps the locator cursor could
alternate between two different shapes in this (abnormal) case. I don't know 
what the best answer is for this problem - comments?
 
As to what an application should do for lost events, we can easily answer these
questions. If the keyboard events are discarded, it will be as if the user never
struk the key. The application will be unaware of the lost events. For locator
button events, especially timing sensitive double and triple clicks, the lost
events will not be in the data stream but the "lost events" event will be. The
application can take action based on this new event type - usually to ignore 
any partially completed operation. For locator motion, applications already have
to be able to process non-linear motion since the tablet reports position and 
not relative information. 

I admit that accurate locator button tracking is difficult, especially since 
there are timing windows with can cause a lot of pain. For instance consider 
the problem of what happens when you are in a marginal network condition, have
down clicked to make a pull-down selection, started moving the mouse, buffer
overrun occurs, you continue to move the mouse (discarding events), buffer
overrun clears, and you release the button. Unless the application is careful,
this situation can lead to disaterous results. 

Comments?

60.55KONING::KONINGNI1D @FN42eqTue Feb 14 1989 17:348
Clearly the crudest possible response for an application that receives an
"events lost" would be to give up.  That would make it no more crude than
the present approach.  Of course applications can do better; how much better
depends on the application, the skill of the designer, etc.  I'd certainly
go along with the comments in the preceding response.

	paul

60.56PSW::WINALSKIPaul S. WinalskiTue Feb 14 1989 17:569
I like the idea of an "events lost" event.  The author of an application knows
which events the application has elected to receive.  The application is in
the best position to determine whether the loss of events is recoverable or
not--right now, it is the server that decides (and it always decides that an
event loss is unrecoverable).  My educated guess is that the vast majority of
"events lost" events would indeed be recoverable by the application.

--PSW

60.57MYVAX::ANDERSONDave A.Tue Feb 14 1989 18:167
    To make the decision easier for the application, report what type of
    events were lost (keyboard input, mouse motion, mouse button, etc?).
    This requires maintiaining only a negligible amount of additional state
    information.
    
    	Dave 

60.58More ideasDECWIN::FISHERBurns Fisher 381-1466, ZKO3-4/W23Tue Feb 14 1989 18:1919
    For "events lost" we should probably allow the client to say something
    about what events he can tolerate loosing (a hint, presumably), and
    also, the "events lost" message should probably tell something about
    the nature of the events.  For example, if the client knew that the
    messages lost included mouse motion and expose, it could completely
    repaint itself and Query the mouse position.
    
    BTW, there is a conference discussing X protocol change proposals. 
    It's not very active, but maybe it should be.
    
    BTW2, I would like to hear some more discussion of why/why not this is
    a problem on TCP.  If I were to lobby for something like this, I would
    need to make good arguments to Unix people.  (Don't take that to mean
    that good-ole-Burns will get this little protocol thing solved for the
    next version.  This would take more than a little deep thought,
    argument, and lobbying)
    
    Burns

60.59It applies to EVERY transportKONING::KONINGNI1D @FN42eqWed Feb 15 1989 12:2230
The problem is clearly independent of transport.  It applies equally well
to TCP/IP, to the local transport, and so on.  

After all, the problem isn't really the transport at all.  The problem is
application level flow control: the possibility that the server is generating
data (events) faster than the client is accepting them.  As things stand,
the application layer flow control is mapped into transport layer flow control,
since the client stops issuing Transport receive requests, which eventually
blocks Transport send requests at the server.  So the server application
ends up with data that it can't send.

It's usually well understood that distributed applications require flow
control to bound the size of the queues.  There are a couple of possibilities
in the general case:
	1. Design the receiver such that it is guaranteed to run at least
	   as fast as the sender.
	2. Have the sender stop generating new data when the queue is too large.
	3. Discard data when the queue is too large.

X does none of these; it uses the "off with his head" approach.  Given the
properties of X, #2 is not possible (event generation is controlled by the
user at the keyboard/mouse, not by the server alone).  #1 is also not
practical, so that leaves #3.

Note that I didn't mention DECnet anywhere in this discussion; it's all
transport-independent.  (Or you might say that the whole discussion was
in the application layer, not the transport layer.)

	paul

60.60I think "KISS" is the necessary magicBORA::MARTIBeat Marti - ISV Support - MR4-1/H19 - 297-3074Wed Feb 15 1989 13:5018
The problem is not one of transport. It should also not be left up to the
application (or application programmer sprinkling silly event queue flushes
all over the code) to solve the problem. It seems to me, that the only place
where we can think about some reasonable solutions is right where the problem
occurs - at the server.

I don't see anything wrong in stealing the idea from the terminal handlers
which simply ring a bell when the buffer overflows. How about if the server
would simply freeze the pointer, or better yet - change the shape of the pointer
similar to the watchband - within the windows of the application which
is to receive the events which are going to be dropped. In addition, make
sure that any mouse clicks, keyboard inputs or such actions directed to that
application result in some easily identifiable response, maybe something
like rining the bell.

I don't know how complicated it would be for the server to implement such
functions - but the concept definitely seems simple enough....

60.61LostEventSTAR::BRANDENBERGIntelligence - just a good party trick?Wed Feb 15 1989 14:41119
    
    re .61:  Beautifully stated.  The mapping of application flow control
    onto transport flow control is precisely the problem.  Transport
    implementations have different flow control and so X appears to operate
    differently on different transports.  However, the fault lies with the
    protocol design and server semantics.
    
    (An aside:  it is truly a pleasure to be talking to people other than
    myself for once.  Thanks to all for your participation.)
    
    I'll conclude from the replies that true reliability and not probable
    reliability should be a goal for a DECwindows server.  I concur with
    Paul Koning's conclusion that this implies that the server may drop
    data it attempts to send to a client.  There are two types of data
    which a server may send to a client:  replies and events (errors are
    encoded as events).  What is the "best" way to handle each type?
    
    I'll consider replies first.  A reply is generated in response to some
    client request and so there is some indication that the client will try
    to cooperate with the server.  But what if sending some reply should
    block?  I see three possible responses:
    
    	1.  Drop the connection at the first sign of congestion.  This
    	    certainly guarantees that the server never hangs but it isn't
    	    really reliability.  The protocol will in theory allow replies
    	    to be as large as 16GB.  How well the client is able to sink
    	    the reply data from the server will depend upon what is being
    	    sent, how well the network is operating, whether page faults
    	    are being serviced, the relative speeds of the client and server
    	    machines, how the client is being scheduled, and a host of
    	    other factors.  The client may be trying to read the reply but
    	    the program environment, which it can't control, may not allow
    	    the client to keep up with the server.
    
    	2.  Guaranteed transmission of replies.  This will ensure that any
    	    "best effort" client will receive a reply but now the server
    	    will hang until a reply can be buffered by transport.  I've
    	    already given examples where this time may be unacceptibly 
    	    large.
    
    	3.  Best effort attempt.  How long can we allow a server to hang
    	    in an attempt to transmit solicited data to a client?  Decide
    	    this and use it as a timeout on reply transmission.  This
    	    doesn't give 100% reliability *but* we now can quantify the
    	    amount of time a client can take to read a reply if it wants
    	    to retain its connection.  I prefer this choice.
    
    Onward to events.  What is done here will have far-reaching
    implications on the whole decwindows engineering effort.  There are
    some applications whose compute tasks are so large relative to the user
    interface component that they won't mind rebuilding the interface
    should something be lost.  Other's will be mostly user interface and
    will want to do as little as possible to recover from a gap in event
    transmission while still being reliable.  The most interesting of this
    latter type, I believe, is the toolkit itself.
    
    With that in mind and a predilection towards the "lost events" event
    and some experience with the intransigency of those who control the
    protocol, I'll consider that possibility.  Review the protocol manual
    and read the x.h and xlib.h files to see what kinds of events are being
    generated and what they cause.  It was suggested that mouse motion
    events are the primary candidates for encountering congestion but they
    are not all.  Mouse motion can also generate Enter/LeaveNotify events
    for windows up and down the heirarchy.  The offending mouse motions
    could have been part of a button down sequence so that not only are
    mouse motions lost but also that all-important button-up event.  Add
    grap/ungrab events to this mess, also.  Then, there are the
    ConfigureNotify and Expose-type events.  These are usually caused by
    another client (the window manager) and failure to respond to these
    will certainly cause ugly holes.  Also, some of these events are
    "counted" events.  I.e., they contain fields which count down the
    number of events which an application may *reliably* expect but which
    may be lost in the new system.  Then, there is the Brave New World of
    events defined by unimagined extensions.  The amount of state that
    needs to be kept and transmitted to the client isn't that small. 
    Basically, an application must be able to enquire as to the exact state
    of the server or at least be able to return it to a known
    configuration.
    
    Without thinking too hard, I'll take a stab at such an event and what
    kind of support it will require.  It is probably interesting to know
    what type of events were lost.  There are 128 possible events (256 if
    one wishes to distinguish "natural" events and those sent with
    XSendEvent).  128 bits requires four longwords.  The event header is
    another longword.  Since this event encompasses a range of activities,
    the application may want to know how long it was out.  If so, reserve
    and additional two longwords for either timestamps or full sequence
    numbers to indicate when events began to be lost and when event
    transmission (always beginning with this event) resumed.  We now
    have seven longwords of an eight longword event packed used.  The
    remaining longword could be used for modifier and mouse button state at
    the time the event lost event is transmitted.  (NB:  This event is
    connection-wide:  it does not associate with any one resource.)
    
    What kind of response might a client want to take on receipt of such
    an event?  It could:
    
    	1.  Give up.  Currently, the server does this for it but now a
    	    client will have to do it itself.
    
    	2.  Update everything.  This means repaints, dropping active and
    	    passive grabs, fixing keyboards if they've been changed, etc.
    	    Consider all the resources that may be involved and this may be
    	    a very time-consuming task (as in the case of the toolkit).
    	    A review of the Xlib interface is needed to ensure that we
    	    *can* restore to a known state.
    
    	3.  Intelligent/Selective Update.  The application needs to perform
    	    query operations to see what what changed.  We may need new
    	    protocol requests to query the window layout (as visible not as
    	    defined by the application), GC's, colormaps, and other
    	    resources.  Extensions must provide equivalent functionality.
    	    Additional work is needed in the server up through every
    	    application.
    
    Comments?
    
    						Monty

60.62KONING::KONINGNI1D @FN42eqWed Feb 15 1989 14:5141
.63 says part of what I was in the process of replying to .62...

Re  .62: It's not that simple: there may be multiple clients using the same 
server. (In fact, there just about always are multiple clients.)  The property 
you MUST have is that one client's lack of progress does not block other
clients.  So you can't simply stop accepting keystrokes, or mouse motions,
or whatnot, since some of those inputs may be going to clients that are
operating correctly.  And of course some events are generated by the
actions of other clients: if client A deletes a window, client B may
receive an exposure event.  Clearly it would not be valid to prevent A from
deleting that window.  

If events have to be discarded, and the events are input (keystroke, mouse)
events, then a bell or some similar feedback may be a good idea.  But
with or without that, I believe an "events lost" indication is essential.
If an application wants to take a head-in-the-sand attitude it can simply
ignore such events, though this would tend to result in low quality
applications.  A full repaint (treating events lost as a full exposure event)
is probably the minimum that makes sense.  As .63 points out, restoring
ALL the state may take a lot of work.  There's probably a subset of the
state that could be restored efficiently; something on the order of what
is restored on deiconize.  (Then again, I may simply be showing off my
ignorance of the complexities of X here.)

As for the suggestion to provide some more detail on the events-lost
notification (e.g., classes of events that were lost): that might be useful,
though I suspect most applications wouldn't make use of that.  The fact
that anything at all was lost would be grounds for recovery actions; since
the application isn't supposed to be falling behind and losing things as a
normal operating mode, you wouldn't want to make those recovery actions
all that sophisticated.  There's a rule about "this shouldn't happen" type
of error recovery code -- it says that such code in fact doesn't work in
the field, since it's not tested during field test, certainly not in all
its permutations.  This argues for keeping the lost-event handling code
simple, since in most applications and most configurations it should be
rare.  (Another way to justify that it should be rare is that this event,
when it occurs, disrupts the user interface.  So a human factors argument
says that it must not occur often.)

	paul

60.63Getting back to the problem, if not the subjectPOOL::HALLYBThe smart money was on GoliathWed Feb 15 1989 16:5420
    Designing protocols can be fun, and you guys are doing such a great job
    that I don't need to make any contributions.  But I am worried about what
    appears to be the harder problem -- those long-running applications that
    don't want to change their code.  It seems to me that if you have a
    developer who's going to make use of "lost events", repaint the screen,
    clean up etc., then you probably have a developer who's going to write
    good enough code so that the problem doesn't arise in the first place.
    
    But what do we do about the application that goes in to a black hole and
    ignores events for a long time?  Should we provide developers with real
    fast test-1-bit type instructions (if set, call the event queue processor)?
    Or should we provide some way (like ASTs, but not ASTs) to sort of force
    a reluctant application to process events?
    
    It should be OUR desire to make DECwindows such an attractive system that
    ISVs will want to use it.  Forcing X calls into application loops isn't
    the way to advance that cause.
    
      John

60.64PSW::WINALSKIPaul S. WinalskiWed Feb 15 1989 17:2439
RE: .65

I don't see where the case you're referring to is a problem.  Suppose we have
an application that does some DECwindows setup, then calls a subroutine that
does several day's worth of number crunching, ingnoring the X event queue the
whole while.  Upon leaving that subroutine, it updates the DECwindows screen.

What happens now is that if the event queue fills up, the server drops the
connection and the application bombs once it leaves the subroutine.

If the "events lost" event is added, the application can find out that this
has happened, if it wishes to, and can take corrective action.  If the
programmer ignores the "events lost" event, then it's possible that there might
be misbehavior.  So what?  At least with "events lost" events, this sort of
application can recover from lost events.  With the current protocol design, it
cannot.  Note also that the check for events doesn't have to be in the number-
crunching subroutine this way--we haven't forced the programmer to turn his
application inside out.


Regarding getting this change accepted by the X Consortium--the "events lost"
event seems to be in keeping with the general X philosophy of pushing work back
on the application.  Just as an application must decide if exposure events are
significant and if so, process them, "events lost" events put the decision on
whether event buffer overflow is significant in the hands of the application,
not the server.  If the application chooses not to handle such events, it can
either ignore them or abort.  One would expect the DECwindows Toolkit to recover
from such events, of course.

A properly-designed server that is supposed to handle more than one client
simultaneously should never let itself get into the situation where a flow
control problem with one client blocks the entire server.  This can be done
without the server imposing any kind of timeouts on client connections.  Link
breakage detection and timing out of connections should be the job of the
underlying virtual circuit transport on which the server/client communication
is based--it should not be done by the server itself.

--PSW

60.65I think I can answer the question about why TCP is less affectedRIGGER::PETTENGILLmulpWed Feb 15 1989 20:3734
TCP (it doesn't need to be IP, but usually is) is byte stream oriented.  This
means that the application needs to provide its own record framing (which isn't
usually much of an issue) and `interrupt messages' are sort of a kludge (if
you don't have records, how do you know where to insert data that is supposed
to skip to the front of all other data without hopelessly confusing the
application).  However, X fits TCP well (for hopefully obvious reasons) since
it has its own `record structure' and it doesn't use interrupt messages.

So, how does this help ?

Being a byte stream protocol, TCP is geared to handling a byte stream.  It's
flow control units is bytes, not records, and it gets to decide for the most
part when to transmit data based on either a timer or some fraction of its
buffer being filled.  When it transmits a datagram (usually IP but not required)
the datagram includes the starting byte in the current window and the number
of bytes in this segment.  On the receiving end, the message must ALWAYS be
processed, even if (some of) the data has been received (and passed to the user
and acknowleged) before.  This means that when a TCP connection has a byte
quota of 6000, the connection won't stall until all 6000 bytes of the buffer
are filled.  It is possible to write 1 byte at a time to the TCP socket and
without any ack from the other end, send 6000 datagrams ranging in size from
1-6000 bytes long (IP datagrams can be as large as 8kb).  TCP doesn't need to
keep around 6000 copies of the datagrams in actual or virtual format to operate.

As I understand the VMS DECnet implementation, the pipeline quota in bytes
is simply used to compute the number of outstanding datagrams that will be
used.  Something along the lines of 10000/576 -> 18 datagrams.  If a write
is done for 1 byte, then one datagram is used, and it is possible for 18 bytes
of data to consume the 10000 bytes of quota.  I'm being extreme, but in the
case of mouse events, I expect that no matter how fast a user is, each click
results in a very small amount of data (25 bytes) being written which will
be sent in a separate DECnet datagram.  (As I said, I don't understand this
well, or maybe not at all....)

60.66Oops, I missed the obvious on TCPRIGGER::PETTENGILLmulpWed Feb 15 1989 22:3236
I just did a little checking of what messages actually get sent and realized
that I missed the obvious about TCP.

The receiving end of TCP doesn't need to worry about keeping track of record
boundaries, so it can simply stuff everything in one buffer.  In the case
of the VMS Connection, it normally has a receive and transmit buffer size
of 4096.  After establishing a connection (DECW$CLOCK) and then making sure
that the client would not process any events (^Y) I generated events and
watched how the server system sent about 60 datagrams which filled the 4096
byte buffer on the client system (average of about 70 data bytes each).  Then
I watched as the 4096 byte buffer on the server filled.  About 30 seconds after
the server buffer filled, the server killed the connection to the client.
Each datagram was ack'd by the client system.

Until, both buffers were filled, the server continued to function normally.

In contrast, when I did the same using DECnet/VAX, the server sent about 18
datagrams (each ack'd) averaging about the same size as the TCP datagrams
and then the server stalled.  About 30 seconds later the server terminated the
connection.  The DECnet system had a pipeline quota of 10000.

This suggests a partial solution; since DECnet won't make efficient use of
its buffers (ie., using 1500 bytes to store 60-70 bytes of data), the DECnet
transport module needs to do it.  On the client side, it could do its input
I/O with ASTs and read into a buffer from which it passes data to the Xlib
code.  As long as its is able to get ASTs, it will be able to keep the server
happy until its buffer fills.  Similarly, on the server side, it needs to make
sure that it nevers stalls and when DECnet won't accept any more data, the
transport module needs to move the data into its own buffer.

This is certainly a hack, but I believe that it would only need to be an
interim hack until the Phase V interface becomes available; I'm guessing,
but I suspect that it may help in this area.  If not, then this is the kind
of info Tom Harding et al were looking for a few months back when they were
asking what advantage a stream interface offered and should they support one.

60.67Some events are more equal than othersDSSDEV::TANNENBAUMWed Feb 15 1989 22:4529
    Even with the proposed changes, DECwindows would still be missing an
    important feature available in the terminal world.  Applications
    aren't always well behaved.  If my application runs away, I want (need)
    some way to get control of it without necessarily blowing away the
    process.  I may have invested a lot of time and effort in my current
    application state.  I want to save it if at all possible. 
    
    Even if the a "lost events" event is added, TPU will still need to poll
    the input queue periodically to check for ^C's.  It's too easy to put a
    TPU-based application into an infinite loop.  For example, type
    
    	TPU a := 0; LOOP a := a + 1; ENDLOOP
    
    at EVE's command prompt and watch TPU count to infinity.
    
    Our first attempt at dealing with this resulted in our asking XLIB
    for an AST for any keyboard character.  Performance was abysmal.
    Users type *lots* of keys at a text editor.  Currently we have an AST
    that checks the input queue once a second (XLIB can be called at
    AST level) and sets a flag if there are any events pending.  At
    the top of our interpreter loop, we check the flag and call a routine
    to dispatch any pending events if it is set.  (The tool kit can
    only be called from non-AST level)
    
    Imagine trying to debug an application that goes into an infinite loop
    without being able to type ^Y DEBUG... 
    
    	- Barry

60.68Events lost event rejected by MIT in the pastSTAR::BMATTHEWSThu Feb 16 1989 05:245
An events lost event was proposed to the X11 developers during X11 development
and it was rejected so I am not sure how likely it is to get this into the
protocol.
						Bill

60.69X12R1STAR::BRANDENBERGIntelligence - just a good party trick?Thu Feb 16 1989 09:4028
    
    Getting such a change accepted by the Consortium is going to be a huge
    task.  This change is more than just adding a new event packet.  Here
    are some of the implications:
    
    1.  Throw out the event section of the protocol manual (which is both a
    protocol *and* a server specification).  In the future, event delivery
    becomes unreliable and so there will be no guarantee the count fields
    will be honored or bracketed state changes be undone (such as
    buttonRelease after buttonPress, ungrab after grab, etc.).  An
    "eventsLost" event will require an application to either give up or
    recover server state, much more involved than exposure handling.
    
    2.  Toolkits, Widgets, and any other programming or environment tool
    will have to handle this event gracefully if an application using it is
    to be reliable.  And not just DEC's toolkit:  Athena's, HP's, and every
    Tom, Dick, and Harry, Inc. that makes an X Windows System.
    
    3.  Rewrite event handling in applications.  All applications. 
    Everybody's applications.
    
    Item 1 is a significant enough change to call this "X12."  The
    Consortium will have to be pushed *hard* (or off a cliff) to get this
    change accepted.  After all, how many programmer's care about reliable
    systems?
    
    						m

60.70STAR::BRANDENBERGIntelligence - just a good party trick?Thu Feb 16 1989 09:4413
    
    Re .68 "partial solution":
    
    This is part of what common transport attempts to do.  There is
    obviously a tradeoff between ability to achieve a steam-like appearance
    and the CPU cost of performing the data copies.  I chose a point that
    leaned too far towards performance and not enough towards streams.  I
    currently have some transports running that perform more copying on
    writes and this has improved reliability.  The performance impact is
    not yet known.
    
    						m

60.71KONING::KONINGNI1D @FN42eqFri Feb 17 1989 11:1932
Stream transports can make the problem appear less quickly, but clearly can't
eliminate the problem.

Re .71: the impression I get is one I keep getting over and over from certain
areas: that high quality is a non-goal.  "Good enough for programmers" is
all that is considered necessary.  UGH.  I also don't think the arguments
hold water.  "Event delivery becomes unreliable."  Sure -- but it already 
IS unreliable.  In all the cases where it is reliable currently, it will
continue to be reliable.  In all cases where it is currently so unreliable
that it blows the application completely out of the water, it continues
to be unreliable.  The only difference is that the error is no longer a
fatal error but one that applications can, if they wish, recover from.
Currently, the error is fatal and applications are not given the option
to recover no matter how much they may want to.

There is no compatibility problem.  Any application that ignores the
event will not be any worse off than it is now.  Depending on what it
would have done had it not chosen to ignore the event, it may be very
much better off.  Any application whose developers take the trouble to
do some work to process the event is improved in the process.

In other words, you can't lose.  It is an absolute improvement for every
application.

Re .65: what to do about applications that don't want to redesign their
code to guarantee that events are processed quickly enough to avoid event
loss: that's where a proper multithread support will help.  Put the
application in one thread, the event handler in another, and you're done.
(Well, close, anyway...)

	paul

60.72You can't poll for events often enough, everPRNSYS::LOMICKAJJeff LomickaFri Feb 17 1989 13:2615
After what happened to me yesterday, I am convinced that the current X
transports cannot be made reliable on VMS unless you check for X events
between EVERY INSTRUCTION, perhaps more often than that.

You see, yesterday a machine running a client of my workstation went
into a long cluster transition.  Need I say more?  I will anyway.

I beat on the keyboard and mouse a bit, and sure enough, my entire
STAND-ALONE workstation was hung until the server decided to trash the
offending client, then I could proceed.

My gut reaction to this entire discussion is "how could anybody be so
ignorant as to ignore the flow control problem here".


60.73STAR::BRANDENBERGIntelligence - just a good party trick?Fri Feb 17 1989 13:4518
    re .74:
    
>My gut reaction to this entire discussion is "how could anybody be so
>ignorant as to ignore the flow control problem here".

    Say one of the following in a whining, geekish voice:
    
    1. "It's too haaaaaard to solve."
    
    2. "*I* don't have any problems;  there must be something wrong with
    the user or programmer!"
    
    3. "What flow control problem?"
    
    4. "Zzzzzzzzzzzz.  Snort."
    
    					monty

60.74VWSENG::KLEINSORGEToys 'R' UsFri Feb 17 1989 13:586
    
    As one of the x11-high-and-mighty started a mail message to me two
    years ago:  "Any competent programmer..."
    
    

60.75it can be done compatiblyPSW::WINALSKIPaul S. WinalskiSat Feb 18 1989 15:1951
Sorry, but the line of reasoning in .71 is faulty.  Taking the points in order:

1) Addition of an events lost packet does not mean that event delivery becomes
   unreliable.  As Paul Koning pointed out, event delivery already IS
   unreliable.  Events lost continues to be an error condition, as it is today.
   The only difference is that an events lost packet lets the client decide
   if the condition is severe enough to warrant aborting the connection.  Today
   the server decides unilaterally that the condition is always fatal.
   Applications that cannot deal with the condition for any of the reasons
   that you cite are perfectly free to handle an events lost condition by
   aborting the connection.  The difference here is that those clients that CAN
   handle the condition are able to do so.

2) Toolkits *should* handle the condition gracefully to be of maximum service
   to the user.  Those that choose not to handle the condition gracefully will
   offer service that is exactly like it is today.

3) The change can be done compatibly, with no rewrite required in existing
   applications.  The way to do this is to make enabling events lost
   notification optional.  This could be done in either of two ways:

   o add a routine, call it XSetFlowControl().  This would be analogous to
     XSynchronize().  When you enable flow control, the server will try to send
     events lost packets instead of aborting the link when buffering capability
     is exhausted.  If flow control is not enabled explicitly, you get the
     current behavior.

   o enabling delivery of events lost notifications via XSelectEvent causes
     the server to send such events instead of aborting the link when buffer
     space is exhausted.

   Either of these methods would be completely upward compatible with current
   behavior since an application must explicitly ask for events lost packets
   to be sent, otherwise you get today's behavior.

It should still be a guiding principle of the design of X transports that they
do whatever possible to avoid getting into the situation where either a packet
must be droped on the floor or the server/client connection aborted.  However,
it is a fundamental fact of life in protocols of this sort that data loss due
to buffering capacity being exceed can and will occur.  The trick is to find a
reasonable way to handle the situation.  X's current method--terminate the
client link--is Draconian but effective.  Allowing an application to receive
an events lost event, if the application so chooses, puts the decision whether
to abort the link in the hands of the client rather than the server.  This
seems to me in perfect keeping with the general X philosophy of not having
the server make policy decisions that are better made elsewhere.  Who knows
better than the client itself whether the situation is recoverable at the
client end?

--PSW

60.76devil's advocacySSAG::GARDNERSun Feb 19 1989 19:5623
    I know absolutely nothing about the X protocol per se, so maybe this is
    off the wall.  But it's an obvious enough question that it needs to be
    asked.
    
    Why can't termination of the connection be treated as an "events lost"
    notification?  When a connection is broken, why doesn't the application
    and/or toolkit try to re-create the connection and, if it succeeds, do
    whatever it was going to do to recover from "events lost".  When events
    are lost, the state of the application's windows, etc. are essentially
    indeterminate; it should probably re-construct or refresh them from
    scratch (a previous response suggested doing this regardless of any
    shortcuts that might be possible).  Why can't it just re-create them on
    a new connection?
    
    If such an approach is plausible, it has the advantage of being totally
    compatible with the current X protocol.  Plus it avoids a potential
    pitfall of adding "events lost" notifications.  Suppose an application
    crashes somehow without explicitly tearing down the connection.  My
    impression is that there's no convenient way for me, from the server,
    to abort the connection and recover the server resources that are
    devoted to it.  The server might merrily preserve the connection
    forever, discarding events as necessary.  

60.77PSW::WINALSKIPaul S. WinalskiMon Feb 20 1989 01:176
You can't just reestablish the connection because all of the windows, graphics
context, etc. associated with the connection is destroyed by the server when
the connection goes away.

--PSW

60.78Why not make it a Extension ?LESZEK::NEIDECKERDont force it,get a bigger hammerMon Feb 20 1989 01:5711
Re. 70-71:

	If it is so hard to get thhis additional event accepted by the
	consortium, why don't we make it into a extension package that
	DECwindows servers support ? If it turn out to be the solution, we
	have a bonus, if a server doesn't know the extension, our clients
	(the Toolkit, etc.) falls back to whatever it does today (e.g. nothing).
	Should be little registration hassle ?
			
					Burkhard Neidecker-Lutz, Project NESTOR

60.79SSAG::GARDNERMon Feb 20 1989 12:4711
> You can't just reestablish the connection because all of the windows, graphics
> context, etc. associated with the connection is destroyed by the server when
> the connection goes away.
    
    But doesn't the toolkit/application have a representation of that
    information in the various toolkit data structures?  Since unknown
    events have been lost, you have to walk these data structures anyway to
    restore the windows, graphics context, etc. on the screen.  To this
    (possibly naive) observer, it doesn't seem significantly harder to
    re-create the objects first.

60.80PSW::WINALSKIPaul S. WinalskiMon Feb 20 1989 15:4410
If you lose events, the server still has windows and graphics context for the
application.  It's just that they may not be quite in the state that the
application thinks they are in.  If you break the connection, the server throws
away the windows and graphics context completely.  If the connection goes away
and then the application establishes a new one and restores things, the user
will see the application's windows actually disappear from the screen and then
come back again.

--PSW

60.81If you can't solve the problem, avoid it ?STAR::MANNMon Feb 20 1989 20:0217
	If the server detects that the user is trying to select a
	stalled session, why not just display a skull and crossbones ?

	This method:

	1 - Give the user appropriate feedback
	2 - Prevents the server from entering a (temporarily) deadlocked state
	3 - Prevent the application from being needlessly aborted
	4 - Does not involve any X protocol changes

	Ever notice the terminal driver lock your keyboard ? Guess
	what it would have to do with that character if it let you 
	type it ?

	Or is the X server code unmodifiable in this manner ?
								Bruce

60.82PSW::WINALSKIPaul S. WinalskiMon Feb 20 1989 21:378
It's more complicated than selecting a stalled session.  Suppose you push a
window.  That could cause a string of exposure events, some of which can be
sent and others of which can't because the buffer space was exhausted.  It's
hard to tell before the operation occurs that it could cause somebody to
overflow buffer space.

--PSW

60.83complicated, I believeSTAR::BRANDENBERGIntelligence - just a good party trick?Tue Feb 21 1989 12:4052
    
    Sorry, but the line of reasoning in .77 is faulty.  Consider the
    following extract from page 76 of the X11, Release 2 Protocol Document:
    
    	For a given "action" causing exposure events, the set of events
    	for a given window are guaranteed to be reported contiguously.  
    	If count is zero, then no more Expose events for this window
    	follow.  If count is non-zero, then at least that many more 
    	Expose events for this window follow (and possibly more).
    
    Implications of adding a "LostEvents" event:
    
    1.  Protocol and semantics change.  Count is more like a "hint" than
    a reliable value.
    
    2.  Application programs change.  Certain coding constructs are no
    longer acceptible.  For example, an event handling routine may switch
    on the eventType in an event packet to execute code such as:
    
    switch (ev.type) {
    case Expose:
    	/*
    	 * dump extra expose events
    	 */
    	for (i=0; i<ev.exposeEvent.count; i++) XNextEvent(dpy,&dummyEv);
    	/*
    	 * do generic exposure handling
    	 */
    	do_exposure();
    	break;
    
    This code is correct under the current protocol and server semantics
    but is incorrect after the suggested protocol change is made.
    
    3.  Scope of "LostEvents" Event burdens *all* applications.  For
    reasons of implementability in the server, this event should probably
    be associated with a connection and not one-or-more-per-resource as are
    many events.  (In this respect, it is much like the unmaskable events.)
    So, if an application were to turn on this event, any toolkit it used
    would also see the event.  Or, if any toolkit wanted to receive this
    event and turned it on, it would be turning it on for the application.
    Is everyone prepared to handle this event even if only to ignore it?
    
    The addition of a "LostEvents" event is necessary.  An xlib request
    similar to Paul's XSetFlowControl (or a point revision to the protocol
    version sensed at connection setup time) may be desirable.  But these
    two alone are not sufficient to make X reliable.  Event processing and
    generation *does* change. And, do we have the functionality that allows
    an application to recover relatively conveniently from such an event?
    
    					monty

60.84Some additional thoughts...IO::MCCARTNEYJames T. McCartney III - DTN 381-2244 ZK02-2/N24Tue Feb 21 1989 14:1016
RE: .83

Whether the terminal driver locks the keyboard, or simply throws away any
character for which it does not have buffer space causes identical results. 
In either case the datastream being sent to the host is interrupted and data
is lost. The keyboard being locked does not physically prevent data from being
transmitted on the line, nor does it stop an operator from continuing to stike
keys. Although I agree with the behaviour of the terminal driver, it is not an
adequate model for solving the problem inherent in X. The terminal driver does 
not provide the needed "data lost" indication.

I like the idea of a skull and crossbones, especially if it was imaged inside 
of a solid black locator cursor.

James

60.85Ok, how about this ?STAR::MANNTue Feb 21 1989 20:5120
>It's more complicated than selecting a stalled session.  Suppose you push a
>window.  That could cause a string of exposure events, some of which can be
>sent and others of which can't because the buffer space was exhausted.  It's
>hard to tell before the operation occurs that it could cause somebody to
>overflow buffer space.

	When a session stalls, immediately shrink it to an icon (automatically) 
and queue the event/message to it which caused the stall in an "overflow" 
buffer(s) (and display a skull and crossbones). Now it cannot become the 
recipient of exposure events, can it ? If the session unjams, send the 
overflow buffer(s) and resume normal operation.

	"buffer space exhausted" is a policy, right ? The workstation has not
run out of memory ! Transport is simply advising it that it is no longer
sensible to send messages because they cannot be delivered just now. Just
reflect this condition back to the user in a way that prevents the user from
continuing that session in a non-discretionary manner (make the user use his
memory).
								Bruce

60.86can the server do that?AITG::DERAMODaniel V. {AITG,ZFC}:: D&#039;EramoTue Feb 21 1989 23:199
     re .87
     
>>     When a session stalls, immediately shrink it to an icon (automatically)
     
     Isn't it the window manager (i.e., another client) and not the
     server that knows about things like icons and where they go?
     
     Dan

60.87Some sleight-of-hand, a little smoke and mirrors..POOL::HALLYBThe smart money was on GoliathWed Feb 22 1989 09:577
>     Isn't it the window manager (i.e., another client) and not the
>     server that knows about things like icons and where they go?
    
    Maybe the server could send a message to the wm saying "put this guy in
    the drunk tank".  Come to think of it, the icon box icon looks a bit
    like a jailhouse window...

60.88VWSENG::KLEINSORGEToys &#039;R&#039; UsWed Feb 22 1989 10:0964
    Let's look at what the terminal driver (VMS), terminal (VT200) and
    a random application do...
    
    	Terminal data comes in and the terminal class driver puts
    	the character data into a typeahead buffer and completes
    	a outstanding read if the conditions of the read are met.
    	If the typeahead buffer contents reaches a certain degree
    	of 'full', the class driver tells the terminal to shut-up
    	(XOFF).  It will still accept data until the typeahead is
    	full at which point it drops any further data and returns
    	a DATAOVERRUN when the typeahead is finally read.
    
    	When the terminal gets the XOFF, it *also* (at least on VT200's
    	and VT300's) has some amount of buffering and will buffer
    	transmit data until *its* buffer is full at which time it
    	sets the WAIT LED and drops transmit data.
    
    	The application is ovblivious to all this.  Periodically it
    	reads the typeahead buffer and only knows about any of this
    	when and if it gets a DATAOVERRUN message.
    
    Extend this to the X11 world:
    
    	First, this implies that the client software which manages the
    	connection gets asyncronous notification of a event and moves
    	the packet from the transport to the clients event queue (i.e.
    	this operation is not a side effect of a processing loop in
    	user mode!).  This software sends a message to the server when
    	the clients event queue reaches some degree of 'full' telling
    	the server to shut-up (XOFF).  It of course still accepts new
    	events until it runs out of free packets at which time it
    	starts dropping events and begins to build a event-lost client
    	structure.
    
    	The server, sends event packets off as long as it hasn't been
    	told to 'shut-up'.  By a combination of local buffering on a
    	per-connection basis by the server and the 'slop' in the client
    	side event queue after a XOFF, the event "counts" should always
    	remain valid, that is, even if the server is XOFFED after having
    	sent the packet with the "count", the combination of the client
    	bufering to soak up packets already "in the pipe" and the server
    	buffering of unsent packets would deliver all the packets promised
    	(so it doesn't need to change the meaning of the count to a
    	'hint').  If and when the server runs out of buffering, it starts
    	building a server-side event-lost structure that will be used
    	to build a event lost event when the client starts taking input
    	again.  This implies that the server is smart enough not to
    	send a counted event during a XOFF if there is not enough local
    	buffering available for all the event packets.
    
	When the server reaches the point that it is discarding events,
    	it *can* do some visual 'hints' that it is stalled including
    	ringing the BELL for KB input, turning off autorepeat on the
    	KB, Setting the WAIT LED on the KB and changing the cursor shape.
    	Since *all* of these can be restored to a proper state once
    	the error state is corrected.

    	Now, all of this is probably meaningless, because I've got this
    	nasty feeling that the client-side input queue is built as part
    	of the polling loop, and otherwise the data just stacks up
    	uncollected in the transport buffers (DECnet, whatever).
    
    

60.89I ruminate?VINO::WITHROWRobert WithrowWed Feb 22 1989 13:3157
Im not a Xwindows maven, so dont yell at me.  I'd like to categorize the
later portions of this note (which seems to have migrated somewhat from
the original topic).  I will only be speaking in ``broad conceptual'' terms.

It seems to me that there are two concerns:  1) What should happen when a client
is sourcing events faster than the server is sinking them, and 2) what should
happen when the server is sourcing events faster than a client is sinking
them.

In case (1) it seems that most participants think it is fine if the
client is forced into quiesence (forced to nap) until the server has
caught up with it.  Seems reasonable to me.  Nothing is lost and other
clients are not affected.

In case (2), one can not force the server to nap because that will affect
other clients.  A previous reply suggested a skull and crossbones cursor,
etc.  Other objected that information is getting lost.  Comparisons with
terminal handlers were made, etc...  Can we take this in parts?

a) Does everyone agree that (2) is a ``policy'' issue?  I mean, it's nice
to claim that a client should always be able to sink events at least
as fast as the server can ever source them, but I don't think that is
possible since one can never have infinite buffering.  Lacking infinite
buffering one must have flow control, and that seems to me to mean
``flow control policy''.

b) It seems that flow can be controlled in several places and in many
different ways.  Suggestions have been: Implement flow control in
X protocol, possiblly by throwing excess events into the can and telling
the client we did that; Implement flow control in a lower layer, and if
(2) happens take drastic action (which seems to be what is done now);
Implement flow control in the server by refusing to send events until
the client catches up.  Are there more?

Since (I hope we agree) this is a policy issue, I guess I would like to
see it resolved in the server, since I feel that it is rude of the server
to bombard the client with events, and, in the interrest of robustness,
I would prefer to assume that the server is smarter than the typical
client (and thus should be able to restrain itself).  Also, it is a
single point solution that does not require every
single client to worry about what to do with a rude server (servant?).

To that end, it seems reasonable to handle (2) this way:  When the
server discovers that it is sourceing events faster than a client is
sinking them it should: a) Ignore all user input into the window(s) associated
with the client (Perhaps it should beep for keypad input, and should
turn off the mouse pointer when it enters the window), and b) not send
exposure events to the client.  If the server does save-unders it would
be free to repaint exposed areas itself from its backing store, otherwise
it should just leave the ugly holes in the window.

Later, when the client catches up, the server should again allow user input
in the windows, and (if it wanted to send any exposure events but couldnt)
send an exposure event for the entire window.

Like I said, Dont yell at me!!!!!   ;-)

60.90I rusticateSTAR::BRANDENBERGIntelligence - just a good party trick?Wed Feb 22 1989 15:13177
    
    Re .91:  I'll talk some more...
    
    I agree that this is *at least* a matter of policy but may also be a
    matter of protocol and server specification.  (My earlier reference to
    the interpretation and generation of expose events is sufficient to
    make the latter true.)
    
    I also accept the policy on case (1) where the client can't send to the
    server.
    
    Now, as for case (2), you've summarized the possibilites as being:
    
    a.  Drop events and generate a "LostEvents" event when possible.
    
    b.  Drop the connection.
    
    c.  Drop events but don't give any indication to the application
    	(there may be user/device feedback, however).
    
    If reliability, at least as I understand the term, is a goal, then b.
    is clearly unacceptible.  If either a. or c. is chosen, the protocol
    and server specifications still must change (see previous discussion on
    the interpretation and generation of expose events).  Furthermore, I
    believe that c. is *extremely* unfriendly to the application.  It
    doesn't find out that it has lost events until it either receives
    information on the server state that is inconsistent with its model of
    the server state or the user tells the application, via a "fix-up"
    request, that it is confused.  Consider a window manager in
    window-resize mode:  it has grabbed the server, it's receiving mouse
    motion events to perform stretchy-box operations but the server drops
    some number of mouse motion events *and* the upclick of the mouse.  At
    what point does the window manager find out that information was lost
    so it can ungrab the server and return to a safe state?
    
    Now, I'll jump into a policy definition for all data sent from the
    server to the client.  Keep in mind the following things:
    
    	1.  Any client can send any event to another client with
    	    XSendEvent().
    
    	2.  Clients interact with other clients as a natural part of
    	    operation.  One client's requests may result in any number of
    	    events being generated for any or all of the other clients.
    
        3.  Extensions.  Always remember extensions.  We don't know
    	    what they'll look like or how they'll define their own
    	    events, if they do at all, and use whatever policy we
    	    establish.
    
    	4.  Certain events/state transitions currently guarantee that
    	    certain other events will be sent at a later time.  If event
    	    delivery becomes unreliable without disconnecting a link,
    	    these "guaranteed" events may not be received by the client.
    
    
    The following "#define"'s are taken from x.h.  They represent the
    *currently* defined event codes.  I've also included replies (type
    code '1') and errors (type code '0').
    
    #define X_Reply		1
    
    Reply to request issued by client.  Unlike events and errors which are
    always 32 bytes, this may range in size from 32 to 2^34 bytes.  There
    is some indication that the client will try to read data but should the
    server wait unconditionally for a slow or hung or thrashing or
    malicious client?  I suggest a configurable parameter specifying a
    timeout for reply operations, probably on the order of 5-10 seconds. 
    If the client doesn't respond, disconnect.
    
    #define X_Error		0
    
    Some request generated an error.  Errors generated by asynchronous
    requests are asynchronous, those generated by synchronous requests
    (i.e. those expecting replies) are synchronous and the event is sent in
    place of the reply.  If the latter case, errors should be treated as
    replies and the timeout should be used.  If the former case, they could
    be treated as either replies or as events (they may be dropped).
    
#define KeyPress		2
#define KeyRelease		3
#define ButtonPress		4
#define ButtonRelease		5
#define MotionNotify		6
    
    Indicates that a keyboard key was pressed or released, a mouse button
    was pressed or released or that an "interesting" motion of the mouse
    occurred.  With unreliable delivery, release and press events may not
    match up.  If the Button?Motion masks had been used in requesting mouse
    motion events, a stream of mouse motion data may suddenly stop without
    any indication that a button had been released.  Etc.
    
#define EnterNotify		7
#define LeaveNotify		8
    
    Indication of mouse travel through the window heirarchy.  With
    unreliable delivery, any part of the traversal may be dropped so that
    there will be no indication that the mouse passed out of, into, or
    through some number of windows.  This may confuse some applications.
    
#define FocusIn			9
#define FocusOut		10
    
    Indication of change of input focus to some windows.  Also traverses
    hierarchy much like enter/leave notify so same caveats apply.
    
#define KeymapNotify		11
    
    Report of state of keymap.  Currently, when requested, it is sent after
    every enternotify and focusin event and a client can rely on this. 
    With unreliable delivery, this event may be lost or the preceeding
    focusin and enternotify may be lost thus creating an unexpected event.
    
#define Expose			12
#define GraphicsExpose		13
#define NoExpose		14
    
    Previously discussed.  Has a "reliable" count down field for contiguous
    events.  This no longer works with unreliable delivery.
    
#define VisibilityNotify	15
    
    Sent to a client after hierarchy change operations.  If lost, client
    may not know that a part of the display is now visible.
    
#define CreateNotify		16
#define DestroyNotify		17
#define UnmapNotify		18
#define MapNotify		19
#define MapRequest		20
#define ReparentNotify		21
#define ConfigureNotify		22
#define ConfigureRequest	23
#define GravityNotify		24
#define ResizeRequest		25
#define CirculateNotify		26
#define CirculateRequest	27
#define PropertyNotify		28
    
    *IMPORTANT* See the protocol specification.  Used by window managers to
    intercept application requests for hierarchy changes, etc.  If these
    are lost, the window manager will *REALLY* be confused.  How are these
    recovered?
    
#define SelectionClear		29
#define SelectionRequest	30
#define SelectionNotify		31
    
    Selection events.  Loss of these may mean that several clients think
    that they own a selection or other problems.
    
#define ColormapNotify		32
    
    Notification that a colormap has been changed.  Window managers and
    clients are interested in this.  Loss of this event *will* prevent
    colormap install oscillations. hahahahaha.
    
#define ClientMessage		33
    
    Generic information from one client to another.  Also used to "wakeup"
    toolkit from AST level.  Since this information cannot be recovered by
    a request, who should receive an error if this can't be sent?  The
    recipient or the sender?  Should this be made to execute like replies?
    
#define MappingNotify		34
    
    Report that a modifier, keyboard, or pointer mapping request was
    executed.  Loss of event means that a client may use the wrong mapping
    when it again receives input events.
    
    
    There is more to flow control than just dropping data and repainting 
    windows later.  THIS IS A BIG PROBLEM.
    
    						monty


60.91PSW::WINALSKIPaul S. WinalskiThu Feb 23 1989 16:5322
RE: .92

I agree, it's a big problem.  It's far too big a problem for the server to
arbitrarily decide for a client whether or not the situation is recoverable.
If a client receives a LostEvents event, it knows which events it had enabled
reporting for, and therefore what recovery actions have to be taken (if indeed
any are possible).

Receipt of a LostEvents event is an error condition.  Any client is well within
its rights to treat receipt of this event as unrecoverable and abort the link.
For example, the window manager probably would abort upon receipt of
LostEvents, since the event that was lost might be CreateNotify, DestroyNotify,
or one of the other events that you cited.  On the other hand, I have written
several applications that listen only to a small number of events and don't
really care if they miss one or more of them--if they are told that events were
lost, they can query the server as to the present situation, or for some events
(exposure, for example) they can assume the worst case and do recovery.  Why
should these sorts of applications get terminated unconditionally by the
server?

--PSW

60.92Just say WAITCVG::PETTENGILLmulpThu Feb 23 1989 20:5328
One solution would be to have the server clear the screen and display a big
WAIT whenever it became blocked trying to send to a client.  However, that
might lead to a deadlock, or at least a situation where the user must wait a
long time for things to free up, so the server would need to watch for multiple
^Y's so that I could ask `Are you pounding on ^Y to abort the client?'

Seriously, I'm mostly kidding above.  But now I'm not.

No scheme can prevent data arriving faster than it can be sent out and with a
user involved, you can't `flow control' a user so you are always going to be
faced with the possibility of data overrun.  Therefore, it will be necesary
to discard data one way or another and somehow notify the only thing that can
deal with the problem in an intelligent fashion, the user.  Currently this is
done by waiting for a while and then killing the connection and discarding
all the related data (and probably discarding some or all the data that the
user supplied while waiting) and then when the application clean up is done
the user is notified by the absence of his application and possible gets an
error message.

The proposal to send a `lost event' event is a compatible extension.  If the
application can't deal with problem at all, or only sometime, then the behavior
is no different than today.  If on the otherhand, it can recover, then it is
a big improvement.  Note that the WM can recover totally, although the user
might notice the recovery.  If you don't believe me, just stop the process and
then run it again.  Everything will return to the way it was.  Maybe its not
the best that one could ask for, but it is better not allowing the user to
continue at all.

60.93Window manager can't really recover...DECWIN::FISHERBurns Fisher 381-1466, ZKO3-4/W23Fri Feb 24 1989 12:2815
Just a nit about .94:  The window mananger can't recover completely from the
situation that PSW mentioned:  Loosing a MapRequest.  In this case, the
client which issued the Map will just sit around forever thinking that it
got mapped, but not really being mapped.  When the window manager makes its
"recovery" scan to figure out which windows to work on, it will never deal with
the "hanging" window, because it will assume that the client has not requested
that it be mapped yet.

However, having the window manager abort in this case does not help either.

This is a good example of the dilemmas faced when you try to break the
"reliable byte stream" assumption, though.

Burns

60.94PSW::WINALSKIPaul S. WinalskiFri Feb 24 1989 14:0711
The point is that we DON'T have a reliable byte stream today.  Should the window
manager or any other client get behind in processing events for any of a number
of reasons, the server will abort the connection and discard the queued
events.  The only thing that a LostEvents event does is let the client decide
whether to abort the connection instead of the server.  If the LostEvents
feature is left disabled by default and enabled by an explicit XSetFlowControl()
type call from the client, then the change is completely upward compatible.

--PSW


60.95KONING::KONINGNI1D @FN42eqMon Feb 27 1989 15:4426
Not only do you not have a reliable bytestream now, you never did, and you
never will.  Incidentally, the comment in .74 about "...transports...cannot
be made reliable on VMS" misses a key point: this whole discussion has
NOTHING to do with VMS, it has to do with fundamental and, I would have
thought, well known properties of distributed systems no matter what OS
they are built on.

I can see from the analysis in .92 that, for some applications, recovery
from EventsLost is harder than just repainting the windows.  For many it
will not be, though.  And clearly every one of them always has the option
to declare these to be fatal errors, in which case the situation is no worse
than it is now -- other than being that way by design rather than by
omission.

Something to consider: currently application die when this problem occurs
because the connection terminates (and they don't do Ed Gardner's "devil's
recovery").  If one were to change X by simply adding this event, without
adding the enabling stuff that PSW proposed, then many applications would
still abort since they don't recognize the event.  Is that incompatible?

Then again, I probably can't get away with bitching about the X definition
of "reliable transport" and at the same time proposing this definition of
"compatible".   :-)

	paul

60.96More Fat for the FireSTAR::BRANDENBERGIntelligence - just a good party trick?Mon Feb 27 1989 16:2232
    
    Here are some things to ponder (not directed at any reply, in
    particular).
    
    o  If a "reliable server" is a requirement for some users in some
    applications then do we need to provide a mechanism by which this user
    can enforce this policy?  The policy might state that a client either
    accepts an eventsLost event or takes a quick disconnect on transport
    jam.  Or, it might state that *only* eventsLost clients are accepted.
    If the latter, then sensing the type of client should happen at an
    early stage of a connection, say, when the client transmits its
    protocol level.
    
    o  I've argued that eventsLost processing is basically connection-wide
    and not associated per resource.  Hence, one part of an application
    will unilaterally decide for the entire application what mode it will
    run in.  This isn't a problem for code that is newly written or that
    will be reworked for a new version of decwindows.  But we already have
    a V1.0 product and, I assume, some sort of support commitment.  If so,
    it would be incorrect for a new toolkit library to enable lostEvents
    for an old application or for a new, reliable application to rely on an
    old toolkit library.  How does the left hand keep up with what the
    right hand is doing?
    
    o  The default event processing is not upward compatible with the new
    lostEvents event.  Most event processing will simply consume any
    unrecognized events because there already exist events which cannot be
    masked by XSelectInput so an application must expect the unexpected (to
    some degree).
    
    						monty

60.97KONING::KONINGNI1D @FN42eqMon Feb 27 1989 17:0629
I might want to have a policy that all the applications I use must support
EventsLost.  But I don't see a way to enforce that in the way you
describe; that merely says that the application does something, but what
it does isn't necessarily sane.  Essentially this requirement is one of
those of the form "The application must have high quality".  This sort of
requirement has the Felix Frankfurter property "I know it when I see it".

My conclusion:
a. EventsLost should be added in the next version of DECwindows.
b. Handling of that event should be added in the next (or current, depending
   on planned release date) of every DEC product, an in particular to
   all widgets.
c. Enabling of the sending of EventsLost (per PSW) is the job of the main
   program (via an Xmumble or XtMumble call).  Widgets don't do this.
   Our applications do, of course, as soon as they have been fixed to handle
   the event.

My guess would be that most of the work is in the widgets; the changes to
support the new event would be minor for most applications (though not for
all, obviously).  So by doing the 3 steps I mention, we create the message:

1. Our application are now more robust than before (and indeed more so than
   any others in the industry).
2. Your applications can be, too, with -- usually -- a small amount of effort.
   Just use the new widget library, which, of course, is upward compatible,
   work out the recovery you need, and issue the enabling call.

	paul

60.98Calvin & Hobbes Engineering Inc.STAR::BRANDENBERGIntelligence - just a good party trick?Tue Mar 07 1989 10:3624
    
    I suppose my perception of a need for enforcing a policy comes from
    notes in other conferences with references to "mission critical" X
    applications.  Specifically, both NASA and ESA are looking at using X
    in their manned space programs with ESA actually using it for spacecraft
    instrumentation.  Since Joe Astronaut probably isn't going to start an
    Xtrek session during a flight perhaps I'm overreacting but it was the
    possibility of applications such and a rather cavalier attitude about
    what constitutes sufficient testing in a system exhibiting stochastic
    behaviour that prompted my original tirade around reply .28.
    
    There was a Calvin & Hobbes cartoon some years ago that went like this:
    
    Calvin:  Dad, how do they get the load limit for bridges?
    Dad:  Well, Calvin, they drive bigger and bigger trucks over it until
    		it breaks.  Then they rebuild the bridge and weigh the
    		last truck.
    Calvin:  Oh!  I should have guessed that!
    
    Unfortunately, this is exactly how software load limits are determined
    today.
    
    						monty

60.99Is anyone actually going to do anything?WINERY::ROSETue Mar 14 1989 12:377
    This is an interesting discussion. Sorry I missed most of it while on
    vacation... But is anyone taking an action item to try and get
    EventsLost added to the X protocol (for example, in the ANSI
    standardization process)?
    
    Re .97: By Felix Frankfurter I think you mean Potter Stewart.

60.100let the server do it!NEXUS::B_WACKERFri Mar 24 1989 10:4953
Since xlostevent is so fraught with problems and so unlikely to make 
it past MIT how about another approach?  Use the terminal driver model 
(.88) so the session manager knows of the problem before it is too 
late.  Create a modal message in the offending process window that 
says something like "This process is a hog and you can either wait 
till it's eaten its fill or push the kill button (in the box) if you 
want to get rid of it."  Do all the previously suggested beeping, 
freezing keyboard, skull and crossbones, etc to tell the user that 
no other input will be accepted for this window other than the kill 
button.

What about all the other windows?

First version you could just stall them, too.  Send a message to the 
clients to block until they get a message that the hog is satisfied or 
consciously killed by the user.

Second version, you could make the server smart enough to watch for 
actions that generate messages to the hog.  If that happens then stall 
the initiator and give it a box that says "waiting for the hog, kill 
or wait."  That way if there's no cross-process communication going on 
the hog is the only one to suffer.  You can still have real-time 
graphics output to the thermometer for your nuclear coolant!  You 
could still move a window that is partially obscured by the hog out 
from under it and do virtually everything where there's no geometry 
interaction with the hog.

Advantages I see:
1) The only new protocol is the stall to the client (which may already 
be there for sync??)

2) The user (not the application) is in control of whether or not the 
connection is terminated.

3) All the implementation is in the server so upward compatibility is
guaranteed.

4) A session manager option could enable this functionality or 
the current "tough s__t" approach.

5) It could be a very important differentiation feature between our 
server and other vendors' if MIT drags their heels.

6) You completely avoid the impossible problem of how to recover from 
lost events because you don't lose any.  (There really is no general 
solution to this problem, is there?)

7) The user is in control.

8) The user is in control.

Bruce

60.101KONING::KONINGNI1D @FN42eqFri Mar 24 1989 11:1715
I don't see how that can work.  Some events are indeed caused by user input,
and you could perhaps block those.  But a lot of other events come from the
actions of other clients -- for example, expose events occur if another
window is moved, resized, deleted, or iconized.  You can't block those 
operations because then you would affect other clients.  (If you think it's
ok to affect other clients, you might as well just halt the system when this
problem occurs.)  If you can't prevent other clients from doing the things
that generate these events, then the only alternative, given that you
have no place to store the events, is to discard them and let the affected
client know that this happened.

What' the big deal?  This is elementary stuff in distributed systems design.

	paul

60.102DECWIN::FISHERBurns Fisher 381-1466, ZKO3-4/W23Fri Mar 24 1989 13:1211
Personally, I think the first thing to do is to reduce (but not eliminate)
the problem by giving the server the capability of saving the event away
and trying later while continuing to process requests.  Obviously this
can't go on forever.  The server runs out of memory if a client sits
around idle long enough.  However, it goes a long way toward alleviating
the short term problem.  In the long term, I think you have to tell a client
that it has lost something.  However, we want to minimize the frequency we
have to do this.

Burns

60.103let the user decideNEXUS::B_WACKERFri Mar 24 1989 15:3320
>(If you think it's ok to affect other clients, you might as well just
>halt the system when this problem occurs.)  If you can't prevent other
>clients from doing the things that generate these events, then the
>only alternative, given that you have no place to store the events, is
>to discard them and let the affected client know that this happened. 

You only affect the clients that are muddying the waters of the one 
who's run out of resources.  True that could escalate, but the USER
could still abort if it is the "wrong thing".  A bad apple in the 
barrel affects everyone sooner or later.

>What' the big deal?  This is elementary stuff in distributed systems
>design.

Doesn't that imply a design where the server is capable of restoring 
all the lost context or that the client has a copy of the server 
database, neither of which obtain here.

Bruce

60.104KONING::KONINGNI1D @FN42eqMon Mar 27 1989 13:1810
Resource issues caused by the fact that process P is not running as fast as
it needs to should be confined to process P, and should not affect other
processes.  That's what I was pushing for.  The fact that the other processes
are doing things for the same user is irrelevant.

re .102: I agree, reducing the incidence of the problem is a good first step
while we wait for the real solution, if and when it actually comes to pass.

	paul

60.105Why are we still arguing this?IO::MCCARTNEYJames T. McCartney III - DTN 381-2244 ZK02-2/N24Mon Mar 27 1989 20:0914
>>> A bad apple in the barrel affects everyone sooner or later.

If my application that went off compute-bound was some critical life-support 
or fail-safe control mechanism, I'd sure hate for my display server to decide
that it "wasn't playing by the rules" and disconnect it. The point is real 
simple, you can't stop all the different sources from which events can be 
generated, you can only hope to catch them all. When you can't, you must do
something reasonable. The lost-events event is the clasical way to handle this
type of flow control problem. It's not perfect, but at least the failure modes
are such that you can recover. 

James

60.106CVG::PETTENGILLmulpTue Mar 28 1989 15:456
Here's a variation on the flow control problem.  Try xlsfonts with the full
info option and watch as the server `hangs'.  As the `man page' for it says
under `bugs', this is a problem with the single thread server design.

ELKTRA::DW_EXAMPLES note 116

60.107ULTRA::WRAYJohn Wray, Secure Systems DevelopmentWed Jul 26 1989 15:593
    Any news on this issue?  Are the MIT people looking at it, or have they
    defined it to be a non-problem?

60.108DECWIN::FISHERBurns Fisher 381-1466, ZKO3-4/W23Thu Jul 27 1989 13:2411
I talked to Bob Scheifler about it.  He believes it is a non-problem.  Monty
Brandenberg was going to make a proposal for fixing it.  However, he decided
to leave the company and take up consulting before he could get to it.

Version 2 of DECwindows relieves the problem to a large extent by doing
additional buffering with DECnet.  As has been discussed before, this does
not truly solve the problem, but it does reduce the cases where we see it.
In fact, I have not seen it at all since this happened.

Burns

60.109ULTRA::WRAYJohn Wray, Secure Systems DevelopmentSat Feb 03 1990 16:0518
    I don't understand how he can view it as a non-problem.  Without
    application-level flow-control (and lost-event handling) of some sort,
    I can write a non-privileged application which can cause other
    applications sharing the same display server to crash.  Bugs in one
    application can cause other applications to crash.  Glitches on the
    network which tear down the transport connection can cause applications
    to crash.  I've just demonstrated that a user with a quick mouse finger
    can kill random applications (although it is true that this is more
    difficult now than it was under VMS DECwindows V1).
    
    All this boils down to "X, as defined at present, is inherently
    unreliable", which seems to mean that it is unsuitable for most
    process-control applications.
    
    Or am I missing something?
    
    Is there any record of Monty's proposed fix?  Is it being followed up
    by anyone else within Digital?
60.110A voice from the past coming back to haunt? DECWIN::FISHERBurns Fisher 381-1466, ZKO3-4/W23Mon Feb 05 1990 13:015
He never wrote anything down.  No, it is not being pursued.  It is very hard
to pursue a theoretical problem when there are millions of problems that
customers see (and complain about) every day which are waiting to be solved.

I agree...it is not fixed or solved.  However...